[zfs-discuss] ZFS zvols for KVM virtual machines storage

Gena Makhomed gmm at csdoc.com
Thu Dec 28 13:44:50 EST 2017

On 28.12.2017 19:21, Gionatan Danti wrote:

>> As far I know, XFS by default always use block size 4096 bytes
>> for I/O, so read-modify-write with XFS inside VM will never occur.

> I stand partially corrected: recent xfs versions (at least what provided 
> with RHEL 7.4+) default to use 4k block size *with* read/modify/write 
> behavior for smaller-than-blocksize writes, unless you use O_DIRECT. I 
> just tested it writing 512 bytes to a 4k xfs filesystem: when not using 
> O_DIRECT, 4k read/writes where issued to the disk.

In virtual machines I/O with O_DIRECT can generate only Percona Server
with innodb_flush_method = O_DIRECT, but Percona Server uses 16k pages.

> However, start using O_DIRECT and no more read/modify/write occurs on 
> the filesystem data portion; on the other side, journal metadata updates 
> seems to be *always* full 4k aligned blocks. As you are using Qemu/KVM 
> with disabled host caching, which implies O_DIRECT, I suggest you to 
> check your actual I/O behavior.

Yes, I disable host caching on QEMU/KVM level,
this is recommended for Linux virtual machines.

Actually now I not disable ARC and have primarycache=all for zvols,
and as I can see ARC is used, and it use half of the server RAM:

ARC Size:                               99.74%  125.58  GiB
         Target Size: (Adaptive)         100.00% 125.90  GiB
         Min Size (Hard Limit):          6.25%   7.87    GiB
         Max Size (High Water):          16:1    125.90  GiB

ARC Total accesses:                                     110.71m
         Cache Hit Ratio:                23.53%  26.04m
         Cache Miss Ratio:               76.47%  84.66m
         Actual Hit Ratio:               21.46%  23.76m

         Data Demand Efficiency:         28.66%  63.52m
         Data Prefetch Efficiency:       66.15%  1.11m

But cache miss ratio is high, and ARC is not very efficient, as for me.

> Anyway, the basic reasoning stands: 4k volblocksize are quite slow on 
> mechanical HDDs (unless your workload perfectly fits) and are 
> sub-optimal for modern SSDs also (due to flash page cache being 8/16K 
> nowadays).

IMHO 4k volblocksize on HDDs zvols is good to avoid write amplification.
Write one block is faster than read-modify-write with volblocksize > 4k.

For Percona Server separate pool on SSD should be used and its zvol
should be created with volblocksize 16k for the maximum performance.

>> It is fast, but by default ZFS use half of server RAM for ARC.
>> And it will be double caching - inside VM and at ZFS layer.
>> Probably it is better to use ZFS ARC only for ZFS metadata
>> and use caching only inside VMs if server has low free memory.
> Sure, but it *should* deflate itself under memory pressure from the OS. 
> My plan B is to use primarycache=metadata and secondarycache=data, but 
> for now it is not needed.

I see very low L2ARC usage on servers with KVM, from output of
"zpool iostat -v" command, see alloc and free capacity on cache.

probably on your servers L2ARC almost not used too.

I even set in /etc/modprobe.d/zfs.conf

options zfs l2arc_noprefetch=0
options zfs l2arc_write_boost=838860800
options zfs l2arc_write_max=838860800

but this does not help, L2ARC usage still was very low.

So for all new servers I decide to use SSD for dedicated
ZFS pool for Percona Server and not use ZFS L2ARC at all.

Best regards,

More information about the zfs-discuss mailing list