[zfs-discuss] going beyond 1MB with large_blocks

Andreas Dilger adilger at dilger.ca
Mon Jan 18 17:14:49 EST 2016

On Jan 17, 2016, at 12:24 PM, Phil Harman via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> In [zfs-announce] spl/zfs-0.6.5 released (Sep 11, 2015) we read …
>> * large_blocks - This feature allows the record size on a dataset
>> to be set larger than 128KB.  We currently support block sizes from 512
>> bytes to 16MB.  The benefits of larger blocks, and thus larger IO, need
>> to be weighed against the cost of COWing a giant block to modify one
>> byte.  Additionally, very large blocks can have an impact on I/O
>> latency, and also potentially on the memory allocator.  Therefore, we
>> do not allow the record size to be set larger than zfs_max_recordsize
>> (default 1MB).  Larger blocks can be created by changing this tuning,
>> pools with larger blocks can always be imported and used, regardless of
>> this setting.
> I’ve been experimenting with small numbers of 12Gbps 15K SAS drives in various RAIDZ configurations with multiple sequential streams on OmniOS and Debian 7 and 8.
> For example, having first increased zfs_max_recordsize to 16MB via mdb (OmniOS) or sysfs (Debian)...
> # zpool create tank raidz disk1 disk2 disk3 disk4 disk5
> # zfs set recordsize=16m tank
> # dd if=/dev/urandom bs=1024k count=1024 of=/tmp/file1G
> # for i in 0 1 2 3; do dd if=/tmp/file1G bs=16384k of=/tank/file1G.$i & done; wait
> # for i in 0 1 2 3; do dd of=/dev/null bs=16384k if=/tank/file1G.$i & done; wait

It should be noted that there are potential issues lurking with blocksize over 1MB, due to assumptions in the ZFS code about blocksize w.r.t. the number of blocks in a transaction, prefetch, allocation policies, elevator queue depth, etc.  This is something that we are currently investigating for Lustre+ZFS since it can benefit from larger blocksize as well, but no patches are available yet.

> I’ve been using something like iostat -x 1 and DTrace to see what size IOPS are being issued. On OmniOS it’s 4MB (as expected), but on Debian 7 it’s 512KB and Debian 8 it’s 1MB.
> In https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_disk.c#L547 we see that the limiting factor is BIO_MAX_PAGES (which, being 256, when multiplied by the 4KB page size, yields 1MB).
> How do we get around this limit? Is it as simple as rebuilding the Linux kernel and everything else with a larger BIO_MAX_PAGES? Has anyone done this and lived to tell the tale?

In theory there is larger request chaining than BIO_MAX_PAGES, but I'm not sure
of the details of how this is done, since I see this same BIO_MAX_PAGES limit in the IO submission.

> Yes, I appreciate this will only make sense for niche workloads. I wasn’t really interested, either, until I started playing with 12Gbps SAS drives. However, I can only see it becoming more interesting over time.

Cheers, Andreas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160118/4da2e880/attachment.sig>

More information about the zfs-discuss mailing list