[zfs-discuss] ZFS on top of iSCSI on top of ZFS

Chris Siebenmann cks at cs.toronto.edu
Fri Jan 8 11:13:57 EST 2016


> >> Did you already tried out a setup like this?
> >> Are there any known problems when running ZFS on top of iSCSI on top
> >> of ZFS?
> > 
> >  I'm curious why you're running ZFS on the iSCSI backends, instead of
> > exporting the raw disks to the ZFS frontends. ZFS supports pools of
> > multiple mirrors.
> > 
> > (While our fileserver setup does not use ZFS on Linux or ZFS on our
> > iSCSI backends, we do have a quite similar setup; Linux iSCSI backends
> > exporting raw disks to OmniOS frontends that have ZFS pools of mirrored
> > disks; each side of a mirror comes from a different backend.)
> 
> Hello Chris,
> 
> thanks for your suggestion.
> 
> I see the following disadvantages with that approach:
> * All disks are visible in the head nodes. In our case that would be 80
> disks. This increases complexity and may lead to confusion during setup
> / maintenance.
> * We use multipathing with 2 paths for each system. On Linux multipath
> devices automatically get device names like mpath01 to mpath80. You can
> add an alias for each WWN of all iSCSI disks tough, but this will blow
> up your multipath.conf and may lead to more confusion.
> * I assume that there will be a massive network and CPU overhead for
> iSCSI communication with 80 (single path) or 160 (2 paths) compared to
> only 4 (single path) or 8 (2 paths) connections.
> 
> How many disks do you use in your setup?

 A typically configured fileserver sees 112 iSCSI 'disks' and our
largest fileserver currently has 84 of them used. (The 'disks' are
actually ~450 GB chunks of the backend disks; each backend has 14 data
disks currently.)

 OmniOS multipathing generates essentially opaque (and rather long)
device names. We keep things straight with a layer of locally written
commands that map between the long native device names and friendly
names like 'helsinki:disk02/0', both for generating low-level ZFS
commands and for displaying pool status and so on.

(We feel happier with cover scripts in general, because they let
us completely avoid certain terrible mistakes like 'zpool add pool
/some/disk1 /some/disk2' (where you left out the 'mirror' and, well,
oops).)

 In your environment, I would create something that auto-generates
the appropriate alias names in multipath.conf and so on.

 As far as network overhead goes, I would strongly suggest measuring
it. I would actually expect no real overall overhead over having raidzN
pools on your backends, because you are essentially writing the same
amount of data from the frontends to the backends either way. While
iSCSI has some overhead, I don't believe it has that much, especially
for large IO requests. You may want to look at some iSCSI parameter
tuning, though; there are options you can probably set to reduce the
iSCSI overhead a bit.

(Note that multipathing will not add any extra network volume here.)

 If you are doing purely mirroring between (raw) backend disks on
the frontends, ZFS will not normally split a decent-sized IO between
multiple disks on a single backend the way that raidzN IO gets
fragmented.  If you write a 128 KB block to a pool with a lot of vdevs
(of mirrored disks), ZFS will just write the block to one vdev, which
turns into one write to some disk on each backend (and thus ideally
one iSCSI write request and response on the network).

 Note that raidzN pools on the iSCSI backends will sharply limit your
random IO rate for data that cannot be prefetched. ZFS raidzN pools
generally have a random IO rate that is equal to that of a *single*
disk.

> Did you compare different configurations?

 We haven't done so, partly because maximum performance is not our
primary criteria. We established relatively early on that we wanted
ZFS to handle redundancy, and at the time of our initial design (in
roughly 2007) that meant doing it on the then-Solaris fileservers.

If you want to read more about our setup and my views on iSCSI
tuning:
 https://utcc.utoronto.ca/~cks/space/blog/tech/LikelyISCSITuning
 https://utcc.utoronto.ca/~cks/space/blog/tech/UnderstandingiSCSIProtocol
 https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSFileserverSetupII

 In terms of 10G on Linux, our experience in testing was that it was
not difficult to get quite high data rates (although we did use jumbo
frames). OmniOS fileservers to Linux backends could do impressive IO
rates over 10G (well, before the OmniOS driver fell over with our cards
and we had to switch to 1G for stability).  As far as we could tell we
were generally saturating the actual backend disks before we ran out of
10G network bandwidth unless we had a *lot* of disks in a single pool.

(It sounds like you're going to have lots of disks in a pool, so your
aggregate data rate may be enough to saturate 2x 10G. This is a good
problem to have.)

 Taken from another of your messages:

> With a 128 KB blocksize, compression disabled on both sides and a
> single raidz2 vdev (of disks), I get for a 100 GB file sequentially
> 6000-6400 IOPS, 750-800 MB/s for writing and 1000-2500 IOPS, 130-330
> MB/s for reading.

 A pool of mirrored vdevs of the raw disks should be able to do *much*
better than this for reading. I'm honestly surprised that you don't do
better than this even when talking to a raidzN pool.

 You may want to use something like ttcp to test and measure your raw
10G network performance to make sure that you're not running into some
fundamental limit or misconfiguration.

(There are a whole hierarchy of tests we did, starting with ttcp
performance, then iSCSI performance to and from a single disk just
with dd, and so on up the stack.)

	- cks


More information about the zfs-discuss mailing list