[zfs-discuss] going beyond 1MB with large_blocks

Phil Harman phil.harman at gmail.com
Sun Jan 17 14:24:53 EST 2016

In [zfs-announce] spl/zfs-0.6.5 released (Sep 11, 2015) we read …

> * large_blocks - This feature allows the record size on a dataset
> to be set larger than 128KB.  We currently support block sizes from 512
> bytes to 16MB.  The benefits of larger blocks, and thus larger IO, need
> to be weighed against the cost of COWing a giant block to modify one
> byte.  Additionally, very large blocks can have an impact on I/O
> latency, and also potentially on the memory allocator.  Therefore, we
> do not allow the record size to be set larger than zfs_max_recordsize
> (default 1MB).  Larger blocks can be created by changing this tuning,
> pools with larger blocks can always be imported and used, regardless of
> this setting.

I’ve been experimenting with small numbers of 12Gbps 15K SAS drives in various RAIDZ configurations with multiple sequential streams on OmniOS and Debian 7 and 8.

For example, having first increased zfs_max_recordsize to 16MB via mdb (OmniOS) or sysfs (Debian)...

# zpool create tank raidz disk1 disk2 disk3 disk4 disk5
# zfs set recordsize=16m tank
# dd if=/dev/urandom bs=1024k count=1024 of=/tmp/file1G
# for i in 0 1 2 3; do dd if=/tmp/file1G bs=16384k of=/tank/file1G.$i & done; wait
# for i in 0 1 2 3; do dd of=/dev/null bs=16384k if=/tank/file1G.$i & done; wait

I’ve been using something like iostat -x 1 and DTrace to see what size IOPS are being issued. On OmniOS it’s 4MB (as expected), but on Debian 7 it’s 512KB and Debian 8 it’s 1MB.

In https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_disk.c#L547 <https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_disk.c#L547> we see that the limiting factor is BIO_MAX_PAGES (which, being 256, when multiplied by the 4KB page size, yields 1MB).

How do we get around this limit? Is it as simple as rebuilding the Linux kernel and everything else with a larger BIO_MAX_PAGES? Has anyone done this and lived to tell the tale?

Yes, I appreciate this will only make sense for niche workloads. I wasn’t really interested, either, until I started playing with 12Gbps SAS drives. However, I can only see it becoming more interesting over time.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160117/afed4248/attachment.html>

More information about the zfs-discuss mailing list