[zfs-discuss] ZFS on Large SSD Arrays

Niels de Carpentier zfs at decarpentier.com
Tue Oct 29 16:44:08 EDT 2013


> I have been doing some testing on large SSD arrays.  The configuration is:
>
> Xeon E5-1650
> 24 128GB "consumer" SSD
> 64GB RAM
>
> This system is fast.  At q=200 the drives will do 4K reads at > 800,000
> IOPS.  Doing linear writes, a raid-0 array can hit 6GB/sec.
>
> I am testing this with a 300GB target ZVOL using a test program that does
> O_DIRECT aligned random block IO.  Compression and de-dupe are both turned
> off.
>
> In general, reads seem to saturate out just under 300,000 IOPS at q=200.
> This is not that surprising and can be attributed to SHA overhead.
>
> My concern is writes.  One is that write IOPS are "pedestrian" for this
> size of array at under 80,000 IOPS.  80K IOPS sounds like a lot, but the
> array is capable of a lot more than this.  This 80K also is quite variable
> and is often a lot lower, depending on the pre-conditioning of the target
> volume.  On a particular test at q=10, random writes proceeded at 17,676
> IOPS or 69.06MB/sec.  This was a "100% default" zpool running raid-0.
>
> My bigger concern is what is happening to the drives underneath.  During
> this test above, I watched the underlying devices with 'iostat' and they
> were doing 365.27MB/sec of "actual writes" at the drive level.  This is a
> "wear amplification" of more than 5:1.  For SSDs wear amplification is
> important because it directly leads to flash wear out.

Likely the ashift was automatically set to 13 (8K), which causes each 4K
block to be written as 8K. Make sure to specify ashift=12 when creating
the pool. Also sync writes will by default first be written to the ZIL
(the normal array is used if no ZIL is specified), so that's a doubling of
the writes as well. Set the zvol logbiad property to throughput to disable
this.
>
> Just for fun, i re-ran the above tests with the zpool configured at
> raidz3.  With triple parity raid, the wear amplification jumped to 23:1.

Yes, with 4K blocks this is essentially a 4 way mirror, and so will write
4 times the amount of data. You can use striped mirrors for redundancy and
better performance.

>
> This testing implies that ZFS is just not designed for pure SSD array
> usage, at least with large arrays.  This array is 3TB, but I have
> customers
> running pure SSD array as large as 48TB and the trend implies that PB SSD
> arrays are just around the corner.  I suspect there is some tuning that
> can
> help (setting the block size lower seems to help some), but I would like
> to
> understand more of what is going on before I jump to extreme conclusions
> (although 23:1 wear amplification is pretty extreme).  Eventually, my
> testing will become a paper, so if I am off base, I would like to not
> embarrass myself.
>
> Comments on tuning for pure SSD arrays would be very welcome.  I have seen
> some posts that imply the same issues have been seen elsewhere.  For
> example, someone tried a pair of Fusion-IO cards and was stuck at
> 400GB/sec
> on writes.  This setup is larger/faster than a pair of Fusion cards.
> Again, any help would be appreciated.

Increasing zfs_vdev_max_pending fixed the issue for the Fusion-IO cards,
but I'm not sure if that can be set for zvol's.

Niels



To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.



More information about the zfs-discuss mailing list