[zfs-discuss] going beyond 1MB with large_blocks

Phil Harman phil.harman at gmail.com
Tue Jan 19 18:44:10 EST 2016


> On 18 Jan 2016, at 22:14, Andreas Dilger <adilger at dilger.ca> wrote:
> 
> On Jan 17, 2016, at 12:24 PM, Phil Harman via zfs-discuss <zfs-discuss at list.zfsonlinux.org <mailto:zfs-discuss at list.zfsonlinux.org>> wrote:
>> 
>> In [zfs-announce] spl/zfs-0.6.5 released (Sep 11, 2015) we read …
>> 
>>> * large_blocks - This feature allows the record size on a dataset
>>> to be set larger than 128KB.  We currently support block sizes from 512
>>> bytes to 16MB.  The benefits of larger blocks, and thus larger IO, need
>>> to be weighed against the cost of COWing a giant block to modify one
>>> byte.  Additionally, very large blocks can have an impact on I/O
>>> latency, and also potentially on the memory allocator.  Therefore, we
>>> do not allow the record size to be set larger than zfs_max_recordsize
>>> (default 1MB).  Larger blocks can be created by changing this tuning,
>>> pools with larger blocks can always be imported and used, regardless of
>>> this setting.
>>> 
>> 
>> I’ve been experimenting with small numbers of 12Gbps 15K SAS drives in various RAIDZ configurations with multiple sequential streams on OmniOS and Debian 7 and 8.
>> 
>> For example, having first increased zfs_max_recordsize to 16MB via mdb (OmniOS) or sysfs (Debian)...
>> 
>> # zpool create tank raidz disk1 disk2 disk3 disk4 disk5
>> # zfs set recordsize=16m tank
>> # dd if=/dev/urandom bs=1024k count=1024 of=/tmp/file1G
>> # for i in 0 1 2 3; do dd if=/tmp/file1G bs=16384k of=/tank/file1G.$i & done; wait
>> # for i in 0 1 2 3; do dd of=/dev/null bs=16384k if=/tank/file1G.$i & done; wait
> 
> It should be noted that there are potential issues lurking with blocksize over 1MB, due to assumptions in the ZFS code about blocksize w.r.t. the number of blocks in a transaction, prefetch, allocation policies, elevator queue depth, etc.  This is something that we are currently investigating for Lustre+ZFS since it can benefit from larger blocksize as well, but no patches are available yet.

Andreas, Thanks for the reply.

Yes, I’ve seen something of this … perhaps even with 1MB records … e.g. I’ve seen cases where the volume of data read from disk is far more than that expected to satisfy the demand (almost 8x in one scenarios) …  I wondered if ZFS was prefetching too aggresively, perhaps such that prefetched data was being evicted from the ARC before it could be used … and I’ve done some playing with zfetch_block_cap, zfetch_array_rd_sz, zfs_read_chunk_size which seem to alleviate this … but it’s a work in progress.

>> I’ve been using something like iostat -x 1 and DTrace to see what size IOPS are being issued. On OmniOS it’s 4MB (as expected), but on Debian 7 it’s 512KB and Debian 8 it’s 1MB.
>> 
>> In https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_disk.c#L547 <https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_disk.c#L547> we see that the limiting factor is BIO_MAX_PAGES (which, being 256, when multiplied by the 4KB page size, yields 1MB).
>> 
>> How do we get around this limit? Is it as simple as rebuilding the Linux kernel and everything else with a larger BIO_MAX_PAGES? Has anyone done this and lived to tell the tale?
> 
> In theory there is larger request chaining than BIO_MAX_PAGES, but I'm not sure
> of the details of how this is done, since I see this same BIO_MAX_PAGES limit in the IO submission.

I see this very occasionally (with blktrace) on Linux (although I have to increase /sys/block/xxx/queue/max_sector_kb for it to happen at all). With OmniOS I get the expected large IOPS most of the time.

>> Yes, I appreciate this will only make sense for niche workloads. I wasn’t really interested, either, until I started playing with 12Gbps SAS drives. However, I can only see it becoming more interesting over time.
> 
> Cheers, Andreas

Cheers, Phil

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160119/b853d1cc/attachment-0001.html>


More information about the zfs-discuss mailing list