[zfs-discuss] ZFS on top of iSCSI on top of ZFS

kaazoo kaazoo at kernelpanik.net
Fri Jan 8 05:59:08 EST 2016

Am 07.01.16 um 16:25 schrieb Markus Koeberl:
> On Thursday 07 January 2016 09:51:41 kaazoo via zfs-discuss wrote:
>> In order to have reliable storage inside each storage appliance I use
>> ZFS to replace RAID. The 16 disk appliances use a single raidz2 pool and
>> the 24 disk appliances use a single raidz3 pool.
> I if understand it right you are using a single vdev (14+2 and 21+3) so you get very poor write performance and want benefit from 10 GBit I guess.
> Never used such a setup but I think you should us a pool with 8 mirrors at least for the 16 disk appliances.
> How is the backup managed, ZFS send receive? In this case maybe 3 x 6+2 for the backup.

Well, we need the capacity and the 8 TB disks are a little bit
expensive, so we don't want to use too many disks for redundancy, as we
also mirror 2 storage appliances. So we can even lose a whole appliance
without losing data.

We have 4 storage appliances in total and everything connected with 2
paths to a 10 GBit switch. Naively calculating, the head nodes have 20
GBit of bandwidth for talking to a mirror of 2 storage appliances during
normal use. This would give 10 Gbit of bandwith for talking to each
appliance. When doing snapshots or while scrubbing, the head nodes may
have to talk to all 4 storage appliances at the same. This would give 5
Gbit of bandwith for talking to each appliance.
In reality the usable bandwidth with multiple iSCSI
connections/transfers at the same time on 10 Gbit Ethernet will probably
be much lower, even with jumbo frames and so on.

Following this calculation, optimal would be to have between 500 and
1000 MB/s for reading and writing to a single appliance.

With a 128 KB blocksize, compression disabled on both sides and a single
raidz2 vdev (of disks), I get for a 100 GB file sequentially 6000-6400
IOPS, 750-800 MB/s for writing and 1000-2500 IOPS, 130-330 MB/s for reading.
With 2 raidz2 vdevs (of 8 disks each), I get 6000-6700 IOPS, 760-840
MB/s for writing and 1100-2700 IOPS, 150-340 MB/s for reading.
With a single raidz3 vdev (of 16 disks), I get 4600-6400 IOPS, 580-800
MB/s for writing and 1200-2500 IOPS, 140-320 MB/s for reading.

I haven't tested 3x raidz2 or 2x raidz3 so far.
I haven't tested a mirrored setup of 2 appliances yet, because I only
have one system at the moment. Reads should improve a lot, as ZFS reads
from all mirror members in parallel. Writes should be about the same as
for a single appliance.

The backup cronjobs are using ZFS send/receive.

>> The front nodes then access all 4 storage appliances via iSCSI and use
>> them for 2 mirrored zpools (primary and backup).
>> The iSCSI volumes have 128 KB blocksize, so ashift=17 is used.
>> ZFS features like compression and snapshots are used here.
> If I understand it right how compression is working you may not benefit with this settings.
> Because of ashift=17 you cannot use smaller blocks than 128 KB and because of the max blocksize of 128 KB you never get bigger blocks.
> So the only way to save space is if you get 128 KB of zeros I guess.
> I would suggest to compare it with ashift=12 and see if you get better results for compression and similar performance.

I think you are right. I did some testing with different zvol blocksizes
and a single raidz3 vdev. ashift was 9 (512 Byte) on the storage
appliance. Perhaps I should repeat the tests with ashift=12.

4 KB zvol blockisze:
write: 1200-2000 IOPS, 150-220 MB/s
read: 600-800 IOPS, 50-90 MB/s

8 KB zvol blockisze:
write: 1700-2200 IOPS, 220-300 MB/s
read: 300-900 IOPS, 40-115 MB/s

16 KB zvol blockisze:
write: 3600-4200 IOPS, 460-530 MB/s
read: 1000-1400 IOPS, 140-170 MB/s

32 KB zvol blockisze:
write: 4800-5400 IOPS, 600-690 MB/s
read: 1400-1800 IOPS, 180-230 MB/s

64 KB zvol blockisze:
write: 5500-6000 IOPS, 700-770 MB/s
read: 1100-2000 IOPS, 140-260 MB/s

128 KB zvol blockisze:
write: 5800-6300 IOPS, 730-780 MB/s
read: 1000-2400 IOPS, 130-300 MB/s

> Whats the blocksize of the iSCSI target? I guess 4 KB so why use a different one for ZFS on top of it, so whats the advantage of not using ashift=12?

At the moment, iSCSI uses a logical blocksize of 512 Byte and reports a
physical blocksize of 128 KB:
[sde] 197568495616 512-byte logical blocks: (101 TB/92.0 TiB)
[sde] 131072-byte physical blocks

I will test performance with 4 KB logical blocksize for iSCSI again.

> Which blocksize is samba using? I guess you will get the best performance if you align everything to the samba blocksize. If you force ZFS to a bigger one you get a heavy fragmented pool over time.

My testing didn't involve Samba so far.

In the smb.conf the following is configured:
aio read size = 16384
aio write size = 16384

>From the manpage the parameter 'block size' is per default on 1024 Byte.
Do you have experience with changing these Samba parameters for ZFS?


More information about the zfs-discuss mailing list