[zfs-discuss] High Resource Usage During Copy

Brian Behlendorf behlendorf1 at llnl.gov
Mon Jun 13 13:52:25 EDT 2011


By default ZFS will set the maximum ARC size to 3/4 of memory or all but
1GB of memory, whichever is larger.  This value can be explicitly tuned
by setting the zfs_arc_max module option if needed.  You may want to set
this value to say 50% (or less) of memory on a desktop system which
tends to have very unpredictable memory usage.

In general you'll want the ARC to be as large as possible.  It's
analogous to the Linux page cache in many ways.  The ARC serves as both
a write-back cache and a prefetch cache for ZFS.  All of the dirty data
in the ARC will be regularly synced to disk every few seconds.  To keep
these sync times under control the ARC will begin throttling I/O when
the cache is half full of dirty data.  The rest of the ARC is a read
cache and its contents can be quickly discarded if the memory is needed
by the system.

The rub is that all of this machinery was designed to work well for the
Solaris kernel.  Unfortunately, the Solaris kernel and Linux kernel have
quite different memory management models.  So while things are basically
working there's still a fair bit of work to integrate the ARC more
seamlessly with Linux.  For this reason, you may notice increased CPU
usage by processes like kswapd or arc_reclaim.  I'm confident this can
be improved in the long term, but it requires adapting the ARC to be
more Linux friendly.

As for using the standard 'echo 3 > /proc/sys/vm/drop_caches' trick to
clear the ARC this does work for clean data.  Dirty data will be written
out after a few seconds as part of the sync.  You can monitor the ARC
usage on your system by watching the /proc/spl/kstat/zfs/arcstats file.
Alternately, Gunnar Beutner recently wrote a linux-kstat module which
allows the use of the arcstat.pl perl script on Linux.  This is a fairly
common Solaris tool for monitoring the ARC.  I personally find it
useful.

https://github.com/behlendorf/linux-kstat
https://github.com/behlendorf/arcstat

You may also notice one strange behavior of the ARC under Linux.  If the
ARC grows too large you may see it suddenly shrink by a large fraction
of its size.  This is caused by the Linux VM overly aggressively
reclaiming free pages from the ARC.  This behavior is non-optimal since
it discards a lot of useful cached data, but it is harmless since it was
just a read cache.  This is a known issue which still needs to be
addressed.

Additionally, you can rest assured all dirty data in the ARC is properly
flushed to disk when the file system is unmounted.  It will also be
synced to disk every few seconds so dirty data doesn't linger in the
cache.  However, The total size of the ARC may not decrease, the dirty
data has just been effectively marked clean and is safe on disk.  This
in fundamentally how most file systems work.

Finally, I should at least mention the issue of memory fragmentation.
Internally, the ARC makes heavy use a data structure called a slab.
Simply put a slab is a large chunk of memory which is preallocated and
then chopped up in to smaller equally sized chunks.  This has certain
advantages and disadvantages.  One advantage is that it significantly
speeds up memory allocation time for allocations of a common known size.
This is particularly true on Linux for allocations which use the virtual
address space.

One disadvantage is that a slab may become fragmented.  This means that
when a full slab of memory is allocated it's possible that only a small
fraction of it is actually used.  This effect can artificially inflate
the memory usage.  This is particularly true for the ARC because the
arcstats do not track this inflation factor.  This is one of the reasons
people on the list may suggest setting a smaller size than the default
zfs_arc_max.

If your curious about the current ZFS slab fragmentation on your system
you can check the /proc/spl/kmem/slab file.  Each row in this file
corresponds to a different slab cache.  The size column shows how much
memory is currently allocated for that slab cache, and the alloc column
shows how much of that memory is actually allocated and in use.  

I hope this helps clear up some of the concern/confusing surrounding the
ARC.  As I said, longer term the plan is to more closely integrate the
ARC with the Linux page cache.  As that happens things like the
fragmentation issue will improve and the other small quirks should
disappear.  It's just going to take some additional time and effort to
get there.

-- 
Thanks,
Brian 

On Sun, 2011-06-12 at 07:34 -0700, Robin Humble wrote:
> On Sun, Jun 12, 2011 at 01:29:20PM +0000, Steve Costaras wrote:
> >As for memory usage, yes that's normal for ZFS as to avoid fragmentation it will buffer all writes and flush to disk when memory pressure is great enough or when a file is closed. This can also skew performance benchmarks when you have a lot of memory.
> 
> speaking of benchmarks - I'm trying to run tests with Lustre on ZFS,
> and I'm seeing higher than expected read rates in some cases.
> 
> is there a way to drop the ZFS ARC to make sure I'm reading from disk
> instead of ram?
> 
> the benchmark tests I'm running are small and fast, unlike real
> production loads. Lustre also does its own read and write-through
> caching, so I want eliminate the ZFS ram variable.
> 
> I'm doing a Linux standard 'echo 3 > /proc/sys/vm/drop_caches'
> everywhere before reads, but I don't know if zfs and/or spl are hanging
> onto ram caches under the hood... ??
> 
> cheers,
> robin



More information about the zfs-discuss mailing list