[zfs-discuss] ZFS zvols for KVM virtual machines storage
g.danti at assyoma.it
Thu Dec 28 06:16:30 EST 2017
On 28/12/2017 09:56, Gena Makhomed via zfs-discuss wrote:
> On 28.12.2017 8:51, Richard Elling via zfs-discuss wrote:
>> Finally, in general, it is a bad idea to use recordsize or
>> volblocksize = 4k, even if the phsycial_block_size = 4k.
>> The overhead can be worse than the read/modify/write penalty for
>> larger recordsize and nvme devices are rarely
>> bandwidth limited. YMMV, so test if you can.
> Why? About which overhead you are talking?
> I am use ZFS zvols for KVM storage on server,
> virtual machines use block size 4k to read/write data.
> So I create all zvols for virtual machines with volblocksize = 4k:
> # zfs create -s -b 4K -V 1024G tank/vm-example-com
> When use volblocksize = 128k it will be huge write amplification
> and slow down all server.
> ZFS pool is mirror of two 4TB HDD:
> # zpool create -o ashift=12 -O compression=lz4 -O atime=off -O xattr=sa
> -O acltype=posixacl tank mirror
I second Richard's advice to not use 4k volblocksize/recordsize. I've
benchmarked it on both HDD and SSD and 4k performed significantly slower
than an 8k volume.
This was especially true for HDD, whose performance at small (< 128KB)
transfers are dominated by seek time rather than throughput. As an
example, let imagine a 1k write to a 4k volume; the write will be
performed on a file whose metadata are cached by the ARC, while the raw
data are not found in the cache. A mechanical HDD (with transfer rate of
~ 100MB/s) would do the following:
- seek to find the required 4k block (12 ms);
- read the entire 4k block (0.04 ms);
- compute the new checksum;
- seek to the location where writing the new 4k block (12 ms);
- write the new 4k block (0.04 ms).
As you can see, the raw block transfer rate is vanishingly small
compared to the seek latency. At 128k volume size, you are going to
change the 2x 0.04ms transfers to 2x 1ms ones, hardly a game-changer.
You can argue that a 4k volume will mostly receive aligned, full-block
4k writes, skipping the read-modify-write penalty completely - and for
some workloads, you are right. If you are *sure* your workload is a pure
4k one, sure - go with a 4k volblocksize/recordsize. However, if your
workload is a mixed one (ie: with many not-aligned, not-full-block
writes) I recommend against a 4k volume.
After all, there is a reason why ZFS defaults to a rather large 128k
recordsize. ZVOLs defaulted to 8k volblocksize because legacy
filesystems had the tendency to write many small data blocks, and using
a large volblocksize for an iSCSI-exported ZVOL means much more wire
traffic than what necessary. But *why* 8k? I suppose the default stems
from 8k being standard, and smallest, page size on UltraSPARC CPU.
For SSDs, which are generally not latency bound, things are somewhat
different. They both responds much better to smaller
recordsize/volblocksize and the lower write amplification enable them to
last longer. For an all-SSD volume/filesystem, I would probably chose 8k
On the other hand, I would - again - avoid 4k, for the following reasons:
- performances are lower. I think this is related to 4k receiving much
less optimization in the original code, being UltraSPARC really targeted
at 8k memory pages. Moreover, at least in the old days, ZoL itself was
much slower with such a small block size;
- compression would be greatly reduced, if almost non-functional. As you
should really use ashift=12 when dealing with SSDs, and this imposes a
4k minimum block size, a 4k volblocksize/recordsize can not be
compressed any further. The only exception is for 100% zeroed blocks,
which can be marked as "empty". Note: in this case, you should run with
ZLE rather than LZ4, as the former is faster and the latter does not buy
you anything more.
- more blocks means more memory for metadata tracking, and less memory
for ARC/L2ARC caching. This even affect the CPU power required to
"drive" the filesystem itself. In other words, a 4k volume incurs in a
much higher overhead than a larger (even 8k) one.
As a note, I am running a KVM host with ~10 VM on a 4-disk, 2-way mirror
+ 2 SSDs for L2ARC and SLOG. I am using a regular ZFS dataset (ie: no
ZVOLs involved) with the standard 128k recordsize and LZ4 compression. I
am extremely happy with its performance so far, and the HDDs show quite
low utilization despite the large recordsize.
Anyway, if rebuilding the machine with SSDs *only*, I think I would go
for a 8k or 16k recordsize. 4k would be out of question, however.
Sorry for the lengthy post...
Assyoma S.r.l. - www.assyoma.it
email: g.danti at assyoma.it - info at assyoma.it
GPG public key ID: FF5F32A8
More information about the zfs-discuss