[zfs-discuss] zfs performance

Gordan Bobic gordan.bobic at gmail.com
Fri Feb 19 04:28:44 EST 2016


What you are seeing is the effect of prefetching and hardware caching.
5400 rpm = 90 rps, therefore the data you are looking for only comes under
the head 90 times per second. This is a fundamental, hard upper limit on
random seeking, and statistical worst case is half that (e.g. head cannot
move fast enough to intercept the required sector at the next location so
has to wait for the disc to go all the way around again).

Disks will often transparently prefetch by over-reading, i.e. if the disk
looks at it's seek queue and it determines that it doesn't have to start
moving the head for another millisecond or two, it can read another
millisecond or two off the current cylinder, and cache it, just in case.
Also, if the disk notices there's already a ready queued for the same
cylinder it'll order them so it can read them in a single seek. In the
former case, if your seeks aren't that random, a subsequent seek might just
score a cache hit. In the latter case, you have coalesced two seeks into
one (and you might more lucky than 2x in either case).

Bottom line, the reason you are seeing more IOPS than the RPS of your disk
is because your seek pattern isn't actually random. And how random the
pattern is will depend entirely on the nature of your application. Don't
assume more IOPS than RPS without having measured it under a range of load
conditions specific to your application, and different fill rates of your
pool.



On Fri, Feb 19, 2016 at 2:23 AM, Phil Harman via zfs-discuss <
zfs-discuss at list.zfsonlinux.org> wrote:

> Erm … my WD Red 3 TB drives (~5400 RPM) can hit over 200 random 4 KB
> reads/second … with directio (i.e. no cache) … if I short-seek to just 10%
> of the drive (i.e. the first 300 GB) and fill the queues to a depth of 12+
> … it’s not just about cacheing … most drives also do elevator sorting etc.
>
> My point is that 32 GB is tiny … if I short-seek to just 32 GB on one
> drive, I can get 200 reads/second with just 5 threads … and with 16 threads
> I get >300 reads/second.
>
> I have a pool made up of 5x WD Red 3 TB drives in a RAIDZ3 (yes, that’s
> more parity than data ... because it’s my data) … just for kicks, I've
> created a ZFS filesystem with the primarycache=metadata option and built a
> 32 GB file from /dev/urandom … with 32 threads I get 500 reads/second from
> ZFS, and 1000 reads/second from the Red (i.e. 200 IOPS per drive).
>
> Using DTrace, my zfs_read() latency distribution over one minue looks like
> this …
>
>   ZFS READ (us)
>            value  ------------- Distribution ------------- count
>                4 |                                         0
>                8 |                                         4
>               16 |                                         116
>               32 |                                         0
>               64 |                                         0
>              128 |                                         0
>              256 |                                         12
>              512 |                                         4
>             1024 |                                         1
>             2048 |                                         6
>             4096 |                                         188
>             8192 |@@@                                      2624
>            16384 |@@@@@@@@                                 6168
>            32768 |@@@@@@@@@@@@@@                           10274
>            65536 |@@@@@@@@@@@                              8736
>           131072 |@@@                                      2139
>           262144 |                                         120
>           524288 |                                         0
>
> Those very few that took less than 4096 microseconds are just background
> filesystem activity in the same pool.
>
> Keeping the workload running, I see primarycache=all. After just 10
> minutes (600 seconds) I’m getting about 5000 reads/second from ZFS, still
> about 1000 reads/second from the Reds, and about 1500 reads/second from my
> L2ARC.
>
> My zfs_read() latency distribution now looks like this …
>
>   ZFS READ (us)
>            value  ------------- Distribution ------------- count
>                1 |                                         0
>                2 |                                         7
>                4 |                                         16
>                8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              235600
>               16 |                                         2249
>               32 |                                         251
>               64 |                                         6
>              128 |                                         7
>              256 |                                         9
>              512 |@@@@@@@@                                 72920
>             1024 |@@                                       13324
>             2048 |                                         482
>             4096 |                                         266
>             8192 |                                         3661
>            16384 |@                                        7892
>            32768 |@                                        9212
>            65536 |@                                        5595
>           131072 |                                         1176
>           262144 |                                         55
>           524288 |                                         0
>
> This rather nicely shows the three distinct latency bands of the ARC (8-16
> microseconds peak) , L2ARC (512-1024 microseconds peak) and RAIDZ vdev
> (32-64 milliseconds peak).
>
> The high proportion of ARC hits concurs with the arcstat, which shows an
> ARC size of 21 GB (i.e. about 2/3 of the 32 GB dataset … if you do the
> maths, you’ll see that the 8-16 microseconds peak accounts for about 67% of
> all reads).
>
> So Chris, either you have something horribly wrong with your hardware, or
> your test could be better. I’ve never used fio (I tend to write my own
> benchmarks in C, because then I can be really sure about what they are
> doing).
>
> Full disclosure: this was SmartOS, and I’m using an LSI 9207-8i HBA (it’s
> just the kit I have ready to hand). However, I’ve done a lot of performance
> work with ZFS on Debian 8 in the last few weeks, and have no reason to
> believe that things would be any different had I used ZoL.
>
> Hope this helps,
> Phil
>
> On 18 Feb 2016, at 22:10, Cédric Lemarchand via zfs-discuss <
> zfs-discuss at list.zfsonlinux.org> wrote:
>
> Hum … a single 7,2k drive could deliver an average of 100 IOPS, which
> methodology is used to optain those results ? from my point of view it is
> almost impossible to get 200 IOPS with a random 4k workload on those kind
> of drives without some caching mechanism involved somewhere.
>
> What “iostat -x 3” says, more particularly the %util counter, during the
> bench ?
>
> From my personnal experience, even if I love ZFS, every IOPS benchmarks I
> have done have been really disappointing.
>
>
> On 18 Feb 2016, at 22:14, cvb--- via zfs-discuss <
> zfs-discuss at list.zfsonlinux.org> wrote:
>
> Thanks for summarizing these hints. Very useful. And of course I do know
> about additional features of zfs, which is why I'm looking to migrate to
> zfs.
>
> I played around with the pool, and I'm still wondering what iops
> performance (in terms of the fio benchmark) I could roughly expect from 6
> HGST 7200 spindles. The system has 64GB RAM, but for this test I have
> restricted ARC to 16GB with a 32GB test file size, as I don't want to
> measure RAM performance... I know a lot depends on tuning, and I'm just
> looking for a general guidance. Because even with atime=off, and
> sync=disabled I did not get beyond 150 iops in a fio 4k benchmark, with or
> without compression, and with different recordsizes. A single HGST has 216
> iops in the same test.
>
> fio --rw=randrw --name=4k_test --directory=/mnt/zfs/test --size=32G
> --refill_buffers --time_based --runtime=300s --ramp_time=10s
> --ioengine=libaio --iodepth=64 --bs=4k
>
> I googled for other benchmarks, but I couldn't find any for just 6
> drives... Just trying to understand if I'm doing anything wrong.
>
> Thanks, Chris
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160219/cebafc59/attachment.html>


More information about the zfs-discuss mailing list