[zfs-discuss] ZFS eating up all 16 GB RAM when combining 4k blocksize ZVOL with XFS

Richard Elling richard.elling at richardelling.com
Wed Nov 21 21:01:36 EST 2018

> On Nov 12, 2018, at 2:08 PM, Omar Siam via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> Hi list!
> I am a bit frustrated because I try to debug a problem on a test RAIDZ array. I can reproducably exhaust my 16 GB of physical memory and make that test machine hang or oom kill or do kernel panics on hung task or oom if I configure it so.

By default, zvols and filesystems are written async. Inside ZFS's memory space, the pending
writes are stored in RAM, nominally up to the ARC max size. This dirty data can compete with
applications writing to the zvol. The solution is to commit the data faster, thus reducing the size 
of dirty data in ZFS, or create back pressure on the application so that it doesn't believe ZFS is
infinitely fast. On the ZFS side, there are tunables that apply. See
https://github.com/zfsonlinux/zfs/wiki/ZFS-Transaction-Delay <https://github.com/zfsonlinux/zfs/wiki/ZFS-Transaction-Delay>
https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#write_throttle <https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#write_throttle>

> Is this a bug for the issue tracker?
> I set up kdump crash dumping on oom and hung task and I got two dumps of the situation.
> My problem is: whatever eats up all RAM is not tracked in any statistics I know. htop does not show it and I also don't find any details in the kernel crash dump ps. It is however clear that all memory is used up in that situation by something.
> The setup may be somewhat special:
> * 3 Disk RAIDZ array, old drives 250 GB not the same brand or model
> * ZVOL with a block size of 4k (RAID-Storage/iSCSI1 volsize=100G, volblocksize= 4K, checksum=on, compression=lz4)

If physical block size is 4k, unlikely because of old 250GB disks, then volblocksize=4k is too small.
In general, bad performance happens when volblocksize is too small or too large. Sometimes, the 
penalty of RMW is worth paying for better performance and space efficiency.

> * I formated the zvol with XFS which I forced to use 4k blocks (mkfs -t xfs -d su=4k -d sw=3 -f /dev/RAID-Storage/iSCSI1)
> * Ubuntu 18.04.1, stock kernel 4.15.0-38-generic with their shipped zfs 0.7.5-1ubuntu1 (I also tried 4.18 series kernels with zfs/spl dkms 0.7.11)
> * a cache and log nvme SSD but it does not seem to matter much regarding the crash

good observation, they do not help or hinder this condition.

> * I created the array in freenas but as far as I know that should make no difference even if I now use it in ZFS on Linux
> * I use phoronix-test-suite to run some disk speed test utilities(1808205-RA-SAMSUNGQU86[1], it runs the same utility and configuration several times)
> ** it runs some sqlite injections (3 or 4 times)
> ** fio-3.1 with "block sizes" of 4k and 2M random reading using libaio (Linux AIO) (about 10 times both)
> ** fio-3.1 with "block size" of 2M random writing using libaio (Linux AIO) [2]
> The first run of this last test succeeds. On the second run suddenly the RAM usage spikes from around 4 GB to all of the RAM which crashes the machine or almost all which causes hung task kernel messages and the machine is unresponsive. [3]

Buffering at the app and page cache layers isn't always easy to see, but if you watch the 
kernel memory usage using free, or better yet, telegraf+influxdb or prometheus, then you
should see the breakdown across free, available, and buffer cache. I would expect buffer
cache to be enlightening.

> The ARC statistics show only about 4 GB of RAM use at that moment.

Yes, this is what ZFS is using.
 -- richard

> I am willing to provide any information I can get from the machine to help track that down but I have no more ideas where to look.
> My conclusion is that there is something wrong with ZFS'es ZVOLs and Linux'es kernel based asynchronuos IO.
> As a side note: I also tried to run the same test in a Linux byhve VM on freenas 11.2 RC using that same physical array, the same ZVOL and the freebsd ZFS implementation and that works. I can run the write test and all the other following tests.
> Furthermore this started as a test for iSCSI on RAIDZ. With Linux'es built in LIO/iSCSI implementation the test also finishes on the machine using this setup as the target. Using the alternative SCST implementation the random write test crashes the target machine. Perhaps the former does not use async io and the latter does.
> Best regards
> Omar Siam
> [1] https://openbenchmarking.org/result/1808205-RA-SAMSUNGQU86
> [2] Generated fio-3.1 configuration:
> [global]
> rw=randwrite
> ioengine=libaio
> iodepth=64
> size=1g
> direct=1
> buffered=0
> startdelay=5
> ramp_time=5
> runtime=20
> time_based
> disk_util=0
> clat_percentiles=0
> disable_lat=1
> disable_clat=1
> disable_slat=1
> filename=fiofile
> [test]
> name=test
> bs=2m
> stonewall
> [3]
> crash> kmem -i
>                  PAGES        TOTAL      PERCENTAGE
>     TOTAL MEM  3990273      15.2 GB         ----
>          FREE    34063     133.1 MB    0% of TOTAL MEM
>          USED  3956210      15.1 GB   99% of TOTAL MEM
>        SHARED    65844     257.2 MB    1% of TOTAL MEM
>       BUFFERS        0            0    0% of TOTAL MEM
>        CACHED    64807     253.2 MB    1% of TOTAL MEM
>          SLAB   875668       3.3 GB   21% of TOTAL MEM
>    PID    PPID  CPU       TASK        ST  %MEM     VSZ RSS  COMM
>    5379   4951   3  ffff9844f780ae80  UN   0.0  511368    772 fio
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20181121/819b8644/attachment.html>

More information about the zfs-discuss mailing list