[zfs-discuss] ZFS zvols for KVM virtual machines storage

Gionatan Danti g.danti at assyoma.it
Thu Dec 28 06:16:30 EST 2017

On 28/12/2017 09:56, Gena Makhomed via zfs-discuss wrote:
> On 28.12.2017 8:51, Richard Elling via zfs-discuss wrote:
>> Finally, in general, it is a bad idea to use recordsize or 
>> volblocksize = 4k, even if the phsycial_block_size = 4k.
>> The overhead can be worse than the read/modify/write penalty for 
>> larger recordsize and nvme devices are rarely
>> bandwidth limited. YMMV, so test if you can.
> Why? About which overhead you are talking?
> I am use ZFS zvols for KVM storage on server,
> virtual machines use block size 4k to read/write data.
> So I create all zvols for virtual machines with volblocksize = 4k:
> # zfs create -s -b 4K -V 1024G tank/vm-example-com
> When use volblocksize = 128k it will be huge write amplification
> and slow down all server.
> ZFS pool is mirror of two 4TB HDD:
> # zpool create -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa 
> -O acltype=posixacl tank mirror 
> ata-HGST_HUS724040ALA640_PN2331PBG32EMT-part4 
> ata-HGST_HUS724040ALA640_PN2334PBH1TR5R-part4

I second Richard's advice to not use 4k volblocksize/recordsize. I've 
benchmarked it on both HDD and SSD and 4k performed significantly slower 
than an 8k volume.

This was especially true for HDD, whose performance at small (< 128KB) 
transfers are dominated by seek time rather than throughput. As an 
example, let imagine a 1k write to a 4k volume; the write will be 
performed on a file whose metadata are cached by the ARC, while the raw 
data are not found in the cache. A mechanical HDD (with transfer rate of 
~ 100MB/s) would do the following:
- seek to find the required 4k block (12 ms);
- read the entire 4k block (0.04 ms);
- compute the new checksum;
- seek to the location where writing the new 4k block (12 ms);
- write the new 4k block (0.04 ms).

As you can see, the raw block transfer rate is vanishingly small 
compared to the seek latency. At 128k volume size, you are going to 
change the 2x 0.04ms transfers to 2x 1ms ones, hardly a game-changer.

You can argue that a 4k volume will mostly receive aligned, full-block 
4k writes, skipping the read-modify-write penalty completely - and for 
some workloads, you are right. If you are *sure* your workload is a pure 
4k one, sure - go with a 4k volblocksize/recordsize. However, if your 
workload is a mixed one (ie: with many not-aligned, not-full-block 
writes) I recommend against a 4k volume.

After all, there is a reason why ZFS defaults to a rather large 128k 
recordsize. ZVOLs defaulted to 8k volblocksize because legacy 
filesystems had the tendency to write many small data blocks, and using 
a large volblocksize for an iSCSI-exported ZVOL means much more wire 
traffic than what necessary. But *why* 8k? I suppose the default stems 
from 8k being standard, and smallest, page size on UltraSPARC CPU.

For SSDs, which are generally not latency bound, things are somewhat 
different. They both responds much better to smaller 
recordsize/volblocksize and the lower write amplification enable them to 
last longer. For an all-SSD volume/filesystem, I would probably chose 8k 
as volblocksize/recordsize.

On the other hand, I would - again - avoid 4k, for the following reasons:
- performances are lower. I think this is related to 4k receiving much 
less optimization in the original code, being UltraSPARC really targeted 
at 8k memory pages. Moreover, at least in the old days, ZoL itself was 
much slower with such a small block size;
- compression would be greatly reduced, if almost non-functional. As you 
should really use ashift=12 when dealing with SSDs, and this imposes a 
4k minimum block size, a 4k volblocksize/recordsize can not be 
compressed any further. The only exception is for 100% zeroed blocks, 
which can be marked as "empty". Note: in this case, you should run with 
ZLE rather than LZ4, as the former is faster and the latter does not buy 
you anything more.
- more blocks means more memory for metadata tracking, and less memory 
for ARC/L2ARC caching. This even affect the CPU power required to 
"drive" the filesystem itself. In other words, a 4k volume incurs in a 
much higher overhead than a larger (even 8k) one.

As a note, I am running a KVM host with ~10 VM on a 4-disk, 2-way mirror 
+ 2 SSDs for L2ARC and SLOG. I am using a regular ZFS dataset (ie: no 
ZVOLs involved) with the standard 128k recordsize and LZ4 compression. I 
am extremely happy with its performance so far, and the HDDs show quite 
low utilization despite the large recordsize.

Anyway, if rebuilding the machine with SSDs *only*, I think I would go 
for a 8k or 16k recordsize. 4k would be out of question, however.

Sorry for the lengthy post...

Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti at assyoma.it - info at assyoma.it
GPG public key ID: FF5F32A8

More information about the zfs-discuss mailing list