[zfs-discuss] zfs performance

Phil Harman phil.harman at gmail.com
Fri Feb 19 06:44:39 EST 2016

Actually, no.

Yes, 5400 rpm does, indeed, equal 90 rps (5400 / 60). However, any random read will only have to wait half a rotation on average. This means that a 5400 rpm drive has an average rotational latency of 5.6 ms (1 / 90 / 2). With no buffering or seeks this would equate to a 180 reads/s.

However, we mustn't ignore seek latency. WD seems very coy about publishing the physical characteristics of their drives (e.g. they don't even specify the rpm of my 3TB Red drives). Perhaps this is due to widespread misunderstanding of such metrics?

The thing is, whilst average rotational latency remains a constant, seek latency depends on stride. The head mechanism has mass, so Newton's laws come into play in terms of the time and energy required to accelerate and decelerate the heads in order to get them roughly where they need to be. And then there's the time and effort required to get the head exactly where it needs to be and to wait for it to stop wobbling.

So, whereas it's simple mathematics that define average rotational latency, it's the laws of physics that dominate seek latency. And, as any Starfleet wrench monkey will tell you, "ye cannae change the laws of physics". However, over the past few decades, we computer scientists have done a lot to work to get around them :)

It's not just rpm that differentiates 15k and 5.4k drives, it's size (all the 15k drives I've seen are 2.5", whereas the my Reds are 3.5").

But let's be generous and assume the total average read latency (rotation plus seek) of the WD Red is 11ms. Well, that just puts us back at square one: your claim of 90 reads/s (though you were just lucky).

However, the firmware even in a modern consumer grade 5.4k drive is much less a matter of luck - it's a thing of wonder. Not only does it know where the heads are at all times, it also knows the rotational position of the platters. And if you give it a queue of enough work to do, it can begin to defeat both seek and rotational latency.

No, my numbers are good. I've been doing storage microbenchmarks for over 25 years, and I've worked with some of the best systems engineers in industry (including the inventors of ZFS).

It's important to check your numbers. For me, it's a habit. I am confident of my IOPS numbers and of the quality of the random number generator I'm using to generate seek offsets. Before even seeing this thread, I output all my pseudo random block offsets, sorted them, and counted them. I know I'm getting whole drive coverage, and very few block revisits.

Again, do the maths. Credit where credit is due: we may not know exactly how the WD Red firmware tackles latency, but it does so demonstrably. Maybe that's what they don't want to be measured in terms of simple rpm and seek latency terms alone?

From my raw disk experiments, I only expect 200 reads/s with plenty of threads doing relatively nearby seeks (my test set a range of 300 GB - i.e. just 10% of the whole).

Yes, I'm sure my ZFS random read test benefits from proximity (even though my pool is 2/3 full, it's only about 25% fragmented), and prefetching by the drives (though with primarycache=metadata, I have verified that ZFS does no file-level prefetch ... though for small blocks it may be doing some vdev prefetch).

However, the big remains that I would expect more from the OP's disks.


> On 19 Feb 2016, at 09:28, Gordan Bobic via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> What you are seeing is the effect of prefetching and hardware caching.
> 5400 rpm = 90 rps, therefore the data you are looking for only comes under the head 90 times per second. This is a fundamental, hard upper limit on random seeking, and statistical worst case is half that (e.g. head cannot move fast enough to intercept the required sector at the next location so has to wait for the disc to go all the way around again).
> Disks will often transparently prefetch by over-reading, i.e. if the disk looks at it's seek queue and it determines that it doesn't have to start moving the head for another millisecond or two, it can read another millisecond or two off the current cylinder, and cache it, just in case. Also, if the disk notices there's already a ready queued for the same cylinder it'll order them so it can read them in a single seek. In the former case, if your seeks aren't that random, a subsequent seek might just score a cache hit. In the latter case, you have coalesced two seeks into one (and you might more lucky than 2x in either case).
> Bottom line, the reason you are seeing more IOPS than the RPS of your disk is because your seek pattern isn't actually random. And how random the pattern is will depend entirely on the nature of your application. Don't assume more IOPS than RPS without having measured it under a range of load conditions specific to your application, and different fill rates of your pool.
>> On Fri, Feb 19, 2016 at 2:23 AM, Phil Harman via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
>> Erm … my WD Red 3 TB drives (~5400 RPM) can hit over 200 random 4 KB reads/second … with directio (i.e. no cache) … if I short-seek to just 10% of the drive (i.e. the first 300 GB) and fill the queues to a depth of 12+ … it’s not just about cacheing … most drives also do elevator sorting etc.
>> My point is that 32 GB is tiny … if I short-seek to just 32 GB on one drive, I can get 200 reads/second with just 5 threads … and with 16 threads I get >300 reads/second.
>> I have a pool made up of 5x WD Red 3 TB drives in a RAIDZ3 (yes, that’s more parity than data ... because it’s my data) … just for kicks, I've created a ZFS filesystem with the primarycache=metadata option and built a 32 GB file from /dev/urandom … with 32 threads I get 500 reads/second from ZFS, and 1000 reads/second from the Red (i.e. 200 IOPS per drive).
>> Using DTrace, my zfs_read() latency distribution over one minue looks like this …
>>   ZFS READ (us)
>>            value  ------------- Distribution ------------- count
>>                4 |                                         0
>>                8 |                                         4
>>               16 |                                         116
>>               32 |                                         0
>>               64 |                                         0
>>              128 |                                         0
>>              256 |                                         12
>>              512 |                                         4
>>             1024 |                                         1
>>             2048 |                                         6
>>             4096 |                                         188
>>             8192 |@@@                                      2624
>>            16384 |@@@@@@@@                                 6168
>>            32768 |@@@@@@@@@@@@@@                           10274
>>            65536 |@@@@@@@@@@@                              8736
>>           131072 |@@@                                      2139
>>           262144 |                                         120
>>           524288 |                                         0
>> Those very few that took less than 4096 microseconds are just background filesystem activity in the same pool.
>> Keeping the workload running, I see primarycache=all. After just 10 minutes (600 seconds) I’m getting about 5000 reads/second from ZFS, still about 1000 reads/second from the Reds, and about 1500 reads/second from my L2ARC.
>> My zfs_read() latency distribution now looks like this …
>>   ZFS READ (us)
>>            value  ------------- Distribution ------------- count
>>                1 |                                         0
>>                2 |                                         7
>>                4 |                                         16
>>                8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              235600
>>               16 |                                         2249
>>               32 |                                         251
>>               64 |                                         6
>>              128 |                                         7
>>              256 |                                         9
>>              512 |@@@@@@@@                                 72920
>>             1024 |@@                                       13324
>>             2048 |                                         482
>>             4096 |                                         266
>>             8192 |                                         3661
>>            16384 |@                                        7892
>>            32768 |@                                        9212
>>            65536 |@                                        5595
>>           131072 |                                         1176
>>           262144 |                                         55
>>           524288 |                                         0
>> This rather nicely shows the three distinct latency bands of the ARC (8-16 microseconds peak) , L2ARC (512-1024 microseconds peak) and RAIDZ vdev (32-64 milliseconds peak).
>> The high proportion of ARC hits concurs with the arcstat, which shows an ARC size of 21 GB (i.e. about 2/3 of the 32 GB dataset … if you do the maths, you’ll see that the 8-16 microseconds peak accounts for about 67% of all reads).
>> So Chris, either you have something horribly wrong with your hardware, or your test could be better. I’ve never used fio (I tend to write my own benchmarks in C, because then I can be really sure about what they are doing).
>> Full disclosure: this was SmartOS, and I’m using an LSI 9207-8i HBA (it’s just the kit I have ready to hand). However, I’ve done a lot of performance work with ZFS on Debian 8 in the last few weeks, and have no reason to believe that things would be any different had I used ZoL.
>> Hope this helps,
>> Phil
>>> On 18 Feb 2016, at 22:10, Cédric Lemarchand via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
>>> Hum … a single 7,2k drive could deliver an average of 100 IOPS, which methodology is used to optain those results ? from my point of view it is almost impossible to get 200 IOPS with a random 4k workload on those kind of drives without some caching mechanism involved somewhere.
>>> What “iostat -x 3” says, more particularly the %util counter, during the bench ?
>>> From my personnal experience, even if I love ZFS, every IOPS benchmarks I have done have been really disappointing.
>>>> On 18 Feb 2016, at 22:14, cvb--- via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
>>>> Thanks for summarizing these hints. Very useful. And of course I do know about additional features of zfs, which is why I'm looking to migrate to zfs.
>>>> I played around with the pool, and I'm still wondering what iops performance (in terms of the fio benchmark) I could roughly expect from 6 HGST 7200 spindles. The system has 64GB RAM, but for this test I have restricted ARC to 16GB with a 32GB test file size, as I don't want to measure RAM performance... I know a lot depends on tuning, and I'm just looking for a general guidance. Because even with atime=off, and sync=disabled I did not get beyond 150 iops in a fio 4k benchmark, with or without compression, and with different recordsizes. A single HGST has 216 iops in the same test.
>>>> fio --rw=randrw --name=4k_test --directory=/mnt/zfs/test --size=32G --refill_buffers --time_based --runtime=300s --ramp_time=10s --ioengine=libaio --iodepth=64 --bs=4k
>>>> I googled for other benchmarks, but I couldn't find any for just 6 drives... Just trying to understand if I'm doing anything wrong.
>>>> Thanks, Chris
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at list.zfsonlinux.org
>>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at list.zfsonlinux.org
>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at list.zfsonlinux.org
>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160219/4c96673d/attachment-0001.html>

More information about the zfs-discuss mailing list