[zfs-discuss] ZFS zvols for KVM virtual machines storage

Gordan Bobic gordan.bobic at gmail.com
Thu Dec 28 12:50:46 EST 2017

FWIW, I have always used 4KB volblocksize with zvols backing my VMs with
reasonable results.

More inline below.

On 28 Dec 2017 17:21, "Gionatan Danti via zfs-discuss" <
zfs-discuss at list.zfsonlinux.org> wrote:

Il 28-12-2017 17:17 Gena Makhomed ha scritto:

> As far I know, XFS by default always use block size 4096 bytes
> for I/O, so read-modify-write with XFS inside VM will never occur.
I stand partially corrected: recent xfs versions (at least what provided
with RHEL 7.4+) default to use 4k block size *with* read/modify/write
behavior for smaller-than-blocksize writes, unless you use O_DIRECT. I just
tested it writing 512 bytes to a 4k xfs filesystem: when not using
O_DIRECT, 4k read/writes where issued to the disk.

However, start using O_DIRECT and no more read/modify/write occurs on the
filesystem data portion; on the other side, journal metadata updates seems
to be *always* full 4k aligned blocks. As you are using Qemu/KVM with
disabled host caching, which implies O_DIRECT, I suggest you to check your
actual I/O behavior.

Anyway, the basic reasoning stands: 4k volblocksize are quite slow on
mechanical HDDs (unless your workload perfectly fits) and are sub-optimal
for modern SSDs also (due to flash page cache being 8/16K nowadays).

None of which helps you keep things aligned and minimize write
amplification unless you can tell your FS on top of the zvol about it (for
ext*, see -E stripe-width options).

NTFS use 4096 bytes clusters by default and read/write
> data to disk by 4k clusters, not by 512 bytes sectors.

Maybe you have a point; I've not checked how NTFS compares lately.

It is fast, but by default ZFS use half of server RAM for ARC.
> And it will be double caching - inside VM and at ZFS layer.
> Probably it is better to use ZFS ARC only for ZFS metadata
> and use caching only inside VMs if server has low free memory.

Sure, but it *should* deflate itself under memory pressure from the OS. My
plan B is to use primarycache=metadata and secondarycache=data, but for now
it is not needed.

L2ARC gets filled with evicted pages from ARC, so when
primarycache=metadata, L2ARC cannot ever get anything but metadata in it.

IMO, you should always set primarycache=metadata for the zvol or fs backing
the VM images, and let the VM itself do it's own caching. It will be much
faster and more efficient because the host OS cache is several context
switches further away than the page cache inside the VM.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20171228/39188f6a/attachment.html>

More information about the zfs-discuss mailing list