[zfs-discuss] About zvol performance drop

Phil Harman phil.harman at gmail.com
Thu May 12 07:24:22 EDT 2016


Daobang,

I'm not sure where to start. It's usually better to try changing one thing at a time :)

Why the switch to 4KB writes? That's going to limit your throughput even more (the limiting factor is the number of IOPS each drive can do, so the bigger the IO, the better ... but you've made it smaller).

Why sync=always? It is not surprising that you get such poor performance (now, every write from the client will have to wait for the data and metadata to sync to disk). The default is sync=standard, which with iSCSI configured for writeback should allow the application to decide when to sync (though, because you're using fio's -direct option it will still sync every write).

If you want to see more throughput with this configuration and workload (i.e. sync=always and 4KB random synchronous writes), you need to run more jobs in parallel (-numjobs). But you'll also need to increase the queue depth (-iodepth) and/or the number of iSCSI LUNs if you want to see more than 32 IOPS in flight at any one time.

However, you really need to stop and ask "what I'm I trying to achieve?" My previous comments were not to help you design the perfect system, but to help you understand the performance you were seeing (i.e. that it was as expected).

What does your real world workload look like? What's the IO profile in terms of size/alignment, read/write, random/sequential, sync/async/flush, number of threads etc? Maybe you will need an SLOG/ZIL device. Maybe you will need some L2ARC. Maybe you should be considering SSD, or nearline would meet you needs

Phil



> On 12 May 2016, at 11:11, Daobang Wang <wangdbang at 163.com> wrote:
> 
> Hello,
> 
>     According to your suggestion, i set up a new system(8 X mirror) to verify the performance, it's poor, is there any more suggestion? following the results,
> 
> 1. sync=always
>                capacity      operations    bandwidth
> pool        alloc   free    read  write   read  write
> ----------  -----  -----    -----  -----  -----  -----
> POOL00      2003M  4552748M   0    277      0     2M
>   mirror     259M  569084M    0     34      0   279K
>     sdf         -      -      0     34      0   279K
>     sdg         -      -      0     34      0   279K
>   mirror     262M  569081M    0     34      0   279K
>     sdd         -      -      0     34      0   279K
>     sde         -      -      0     34      0   279K
>   mirror     257M  569086M    0     34      0   279K
>     sdb         -      -      0     37      0   291K
>     sdc         -      -      0     34      0   279K
>   mirror     262M  569081M    0     34      0   279K
>     sdl         -      -      0     34      0   279K
>     sdm         -      -      0     34      0   279K
>   mirror     240M  569103M    0     34      0   279K
>     sdj         -      -      0     34      0   279K
>     sdk         -      -      0     38      0   295K
>   mirror     253M  569090M    0     33      0   271K
>     sdh         -      -      0     33      0   271K
>     sdi         -      -      0     33      0   271K
>   mirror     241M  569102M    0     33      0   271K
>     sdp         -      -      0     33      0   271K
>     sdq         -      -      0     33      0   271K
>   mirror     225M  569118M    0     34      0   279K
>     sdn         -      -      0     33      0   271K
>     sdo         -      -      0     34      0   279K
> ----------  -----  -----  -----  -----  -----  -----
> 
> fio -filename=/dev/sdl -direct=1 -iodepth=32 -thread -numjobs=1 -ioengine=psync -rw=randwrite -bs=4K -size=64G -group_reporting -name=64G_fio4K_SW
> 64G_fio4K_SW: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=32
> fio-2.1.4
> Starting 1 thread
> Jobs: 1 (f=1): [w] [1.4% done] [0KB/980KB/0KB /s] [0/245/0 iops] [eta 05h:23m:51s]
> 
> 
>                capacity        operations    bandwidth
> pool        alloc   free      read  write   read  write
> ----------  -----  -----     -----  -----  -----  -----
> POOL00      25696M  4529055M   296     1K    16M   105M
>   mirror    3228M  566115M     40    237     2M    14M
>     sdf         -      -       25    122     1M    14M
>     sdg         -      -       14    123   958K    14M
>   mirror    3223M  566120M     51    249     3M    15M
>     sdd         -      -       15    129  1022K    15M
>     sde         -      -       35    130     2M    15M
>   mirror    3211M  566132M     31    163     1M     9M
>     sdb         -      -       18    144     1M     9M
>     sdc         -      -       12    242   711K    15M
>   mirror    3221M  566122M     31    281     1M    16M
>     sdl         -      -       19    148     1M    16M
>     sdm         -      -       11    151   647K    16M
>   mirror    3199M  566144M     31    243     1M    11M
>     sdj         -      -       22    165     1M    16M
>     sdk         -      -       8     115   395K    11M
>   mirror    3202M  566141M     30    244     1M    11M
>     sdh         -      -       11    120   647K    11M
>     sdi         -      -       18    120     1M    11M
>   mirror    3224M  566119M     40    310     2M    15M
>     sdp         -      -       12    157   531K    15M
>     sdq         -      -       27    158     1M    15M
>   mirror    3186M  566157M     35    183     2M    11M
>     sdn         -      -       23    138     1M    16M
>     sdo         -      -       11     92   766K    11M
> 
> 
> fio -filename=/dev/sdl -direct=1 -iodepth=32 -thread -numjobs=1 -ioengine=psync -rw=randwrite -bs=4K -size=64G -group_reporting -name=64G_fio4K_SW
> 64G_fio4K_SW: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=32
> fio-2.1.4
> Starting 1 thread
> Jobs: 1 (f=1): [w] [8.1% done] [0KB/3476KB/0KB /s] [0/869/0 iops] [eta 46m:33s]
> 
> Best Regards,
> Daobang Wang.
> 
> At 2016-05-11 18:20:34, "Phil Harman" <phil.harman at gmail.com> wrote:
> >Of course, a fast log device is always a great idea when you want both performance and to keep your data.
> >
> >In this case a SLOG/ZIL device will be of zero benefit, because the OP has gone to great lengths to make sync writes a non issue (i.e. sync=disabled and writeback).
> >
> >But even a log won't save you from eventually needing to write the data back to disk. At some point as the ARC fills up, you'll hit write throttling.
> >
> >With 8 drives in RAIDZ1 config and assuming the default zvol 8KB volblocksize and default ashift=9...
> >
> >ZFS will split one 8KB random write into 7 data + 1 parity writes of 3 sectors (1.5KB) per drive (you check this with iostat).
> >
> >We know they're fast drives (in spinning rust terms), but we haven't been told their size. Let's guess at 600GB, in which case a 500GB zvol would span about 500/7/600 - i.e. about 12% of the drive.
> >
> >So, we might expect about 500-600 random IOPS per drive, and as we know each zvol 8KB random write will generate 1 write per drive, we might expect between 4-5MB / sec throughput once write throttling kicks in.
> >
> >Why do we get less than this? Because we haven't accounted for metadata.
> >
> >In conclusion, 3MB / sec is not unreasonable.
> >
> >Moral: don't use RAIDZ when you do a lot of small random IO.
> >
> >You could speed things up (and waste a whole lot if space) by using ashift=12.
> >
> >Best option is mirroring. But you'll still see only 10-12MB / sec with such small writes once write throttling kicks in.
> >
> >> On 11 May 2016, at 07:47, Sebastian Oswald via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> >> 
> >> Hello,
> >> 
> >> Just to make sure: the disks are directly attached to the system by
> >> a "dumb" HBA, not via RAID-controller as single-disk-raid0 or any
> >> other funky configuration? RAID-Controllers can't cope with how ZFS is
> >> handling the disks and may/will "do funny things". (though not funny for
> >> performance or your data...)
> >> Also the pool was created with 4k blocksize (ashift=12)?
> >> 
> >> ZFS gets its speed from spreading I/O over all available VDEVs. Major
> >> rule of thumb: the more VDEVs the more performance (at reasonable
> >> numbers of VDEVs...)
> >> RAIDZs are a tradeoff between usable space, redundancy and speed -
> >> with priorities descending in this order.
> >> For high (random) I/O applications like VM-storage you should definately
> >> consider another disk layout. Maximum performance would be 4 x 2
> >> mirroring, which also gives the best flexibility (you can 2 disksk at a
> >> time), but only 50% usable space.
> >> A good tradeoff with usable space is 2xRAIDz1 with 4 disks each. You
> >> should benchmark both layouts and decide based on your requirements.
> >> 
> >> Another important point is the L2ARC and SLOG. For high IOPS add an SSD
> >> backed L2 cache and SLOG. This gives by far the biggest performance
> >> boost for ZFS, especially when using it as a storage provider for
> >> multiple systems with relatively low memory on the storage system. 32GB
> >> is relatively low in ZFS-terms - always try to throw as much RAM at ZFS
> >> as technically and financially possible. 
> >> For SSD cache/SLOG make sure to only use proper server-grade SSDs as
> >> consumer SSDs will be hammered to death within a few months (I killed 2
> >> cheap 60GB SSDs in a test system within 3 months...). SATA/SAS SSDs are
> >> fine, NVMe or PCIe are much better.
> >> 
> >> The backing spinny-disk layout should be improved anyways, because ZFS
> >> throttles writes if the SLOG is growing too big too fast because the
> >> VDEVs can't keep up.
> >> This behaviour is tuneable but in almost any case the defaults are
> >> perfectly fine and shouldn't be touched! Making things worse by tuning
> >> these parameters is far easier and common than actually gaining any
> >> improvements.
> >> 
> >> In short:
> >> 1. change your disk layout
> >> 2. add SSD-backed L2ARC and SLOG
> >> 3. ....
> >> 4. profit
> >> 
> >> 
> >> Regards,
> >> Sebastian Oswald
> >> 
> >> 
> >>> Hi All,
> >>> 
> >>>    I setup an system(32GB DDR3), 8 SAS disks(15000 RPM ) created a
> >>> raidz1(sync disabled), and one 500GB zvol(sync disabled), qle2562 FC
> >>> HBA, exported the zvol with SCST 3.1.0(write back, fileio), the
> >>> client installed centos 6.5 X86_64, test command was  "fio
> >>> -filename=/dev/sdb -direct=1 iodepth=32 -thread -numjob=1
> >>> -ioengine=psync -rw=randwrite -bs=8k -size=64G -group_reporting
> >>> -name=fio_8k_64GB_RW", run the test command continuely, at the
> >>> starting, the speed was about 260MB/s(iostat -xm 1), but after
> >>> several times tested, performance dropped to 3MB/s.
> >>> 
> >>>    Would anybody give me a clue? Where is the root cause? How to
> >>> improve it?
> >>> 
> >>> Thank you very much.
> >>> 
> >>> Best Regards,
> >>> Daobang Wang.
> >> _______________________________________________
> >> zfs-discuss mailing list
> >> zfs-discuss at list.zfsonlinux.org
> >> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
> 
> 
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160512/595a5229/attachment-0001.html>


More information about the zfs-discuss mailing list