[zfs-discuss] rsync fileserver - how to push metadata to L2ARC and make it stick there?

devzero at web.de devzero at web.de
Wed Nov 14 16:55:06 EST 2018

good evening,

to catch up with this old thread, i'd like to ask if there is any news on it (maybe i missed some discussion)...
ssd has gotten so damn cheap now (128gb for 25 euro here) and i'd still love to be able to cache backup server's metadata on ssd instead of buying more ram to scale up. i can't even scale to more than 32gb (which is not enough for all metadata) as it's an older machine and even used ram is expensive for this...

meanwhile i came across this blogpost:
i know, it's not the "typical" workload to rsync millions of files, but you may understand having a 2-stage caching mechanism with special tuning knob for metadata centric workload, which simply gives no "real world" benefit - that's a little bit frustrating. especially without a perspective for a real solution besides "bying ram". 
I would like do some real-world testing on this, so could perhaps somebody help on porting ben rubson's original patch for openzfs to recent zfsonlinux? shouln'd be too hard i think.
( https://www.illumos.org/issues/7361 )

i'm no developer  - for example i have problems with this one:
			-if ((write_asize + HDR_GET_LSIZE(hdr)) > target_sz) {
			+if ((write_asize + arc_hdr_size(hdr)) > target_sz) {

that does not apply anymore - as in current zfsonlinux source that line now looks like this:
                        if ((write_asize + asize) > target_sz) {


Gesendet: Sonntag, 27. August 2017 um 23:21 Uhr
Von: "Richard Elling" <richard.elling at richardelling.com>
An: "Gordan Bobic" <gordan.bobic at gmail.com>
Cc: "General discussion - ask questions, receive answers and advice from other ZFS users" <zfs-discuss at list.zfsonlinux.org>, devzero at web.de
Betreff: Re: [zfs-discuss] rsync fileserver - how to push metadata to L2ARC and make it stick there?


On Aug 25, 2017, at 11:07 PM, Gordan Bobic <gordan.bobic at gmail.com[mailto:gordan.bobic at gmail.com]> wrote: 

What kernel module options can be used to make L2ARC feeding more aggressive to reduce the probability that data will be evicted from ARC without having been fed into L2ARC?
There are two main tunables:
l2arc_write_max (default = 8 MiB) limits the number of bytes attempted to write each time the feed thread wakes up

l2arc_feed_secs (default=1) is the number of seconds that the feed thread waits between feeds, it is unlikely you'll want to change this
Thus the default nominal feed rate is 8 MiB/sec. Why so low? Because when that tunable was defined, 10+ years ago, many people were very worried about flash write endurance -- and for good reason, the drives were awful by today's standards. 
It is worth noting that l2arc_write_max is per-system, not per-l2arc device and not per-pool. In the bad old days people could barely afford very many SSDs so there might only be 1 or 2 per system. In a previous life we changed this, but IMHO it isn't worth the effort given the cost of modern SSDs and DRAM. It is simply easier to tune than develop and maintain some algorithm that tries to make this smarter. In other words, if you have the money for a second l2arc SSD, you'll be happier buying more DRAM or a bigger SSD.
Next we have the often misunderstood settings that matter much less than l2arc_write_max (these are unlikely to need tuning):
l2arc_write_boost (default = 8 MiB) which is added to l2arc_write_max up until the point where the ARC is warm (fills to the point where it begins to evict). There is no real need to change this if you make l2arc_write_max bigger because...
l2arc_feed_again (default = 1) will try to run the feed thread again if the system is writing a lot. l2arc_feed_min_ms (default = 200ms) is a damper to prevent this from going into infinite loop. So you can see maybe 3-4 feeds per second for a system that is writing a lot. Recall that 4 * 8 MiB is only 32 MiB, so l2arc_write_max is the big knob.
l2arc_norw (default = 0, but was 1 in many distributions for many years) which is an anachronistic workaround to the problem of awful SSDs that didn't do well when given concurrent read and write commands. Don't change this and we really should deprecate it.
l2arc_headroom_boost (default = 200(%)) is used when compressed l2arc is enabled and allows the scanner to look farther into the ARC lists. It is a multiplier for l2arc_write_max (ultimately).
In looking at this again (I've been in the all-SSD world for a while) it occurs to me that arcstat doesn't show the important metrics for determining l2arc use and sizing. The reason I didn't notice that is because I use telegraf + influxdb + grafana to capture and plot the system metrics -- which means I don't use CLI tools like arcstat, typically. I'll add to my todo list an explaination of how to determine if l2arc is helpful for you.
 -- richard

On 26 Aug 2017 01:50, "Richard Elling" <richard.elling at richardelling.com[mailto:richard.elling at richardelling.com]> wrote:

On Aug 25, 2017, at 6:24 AM, Gordan Bobic via zfs-discuss <zfs-discuss at list.zfsonlinux.org[mailto:zfs-discuss at list.zfsonlinux.org]> wrote: 

Are your cache devices showing up in the output of zpool status?
Anything that gets push out of the ARC (primarycache) will end up in L2ARC (secondarycache) if both are set to "all".
This is not correct. L2ARC is filled optimistically by the l2arc_feed_thread. It wakes up periodically and looks for
data to feed into L2ARC. Some data is eligible, partially based on the secondarycache property, and any ineligible 
data is not sent.

You cannot force the data into L2ARC, it is filled by the data evicted from ARC, but it won't expire from L2ARC until L2ARC overflows. If you want L2ARC to only cache metadata, set secondarycache=metadata on the pool.
Again, this is not correct. Data evicted from the ARC is a completely different decision than data fed into the L2ARC.
It is quite possible for the data churn rate to be faster than the l2arc_feed_thread, especially on low-memory machines.


On 25 Aug 2017 15:11, "Roland via zfs-discuss" <zfs-discuss at list.zfsonlinux.org[mailto:zfs-discuss at list.zfsonlinux.org]> wrote:Hello,

i`m running a large zfsonlinux (0.7.1-1 on centos 7.3) fileserver for rsync backup, containing about 20.000.000 files.

As the backups are mostly incremental and have very small changes, most of runtime is caused by crawling the directory trees, stat()'ing all files to compare if something has changed on the server to be backed up.

That`s a lot of IOPs going to the (slower) raidz sata disks ...

As i have found that metadata size in ARC is considerably huge (about 2GB for 1Mio files, which may sum up to 40GB for 20Mio files), i`d like those metadata to be kept in L2ARC.

To speed things up, how can i force the metadata blocks being pushed to l2arc and mostly stick there ?
In these cases, the metadata should be MFU and the data MRU. So the normal MFU/MRU balancing along with
increased metadata limit will be a good thing.
 — richard

I tried all different kinds of warmup (rsync --dry-run, repeated reecursive stat/getfacl/getattr, repeated reads within milliseconds as i read somewhere that it needs within a timeframe of 62.5ms), set primarycache=all|metadata, but it seems metadata information always expires sooner or later when other metadata being pushed into the ARC, filling it to the limit.

Even if no single file is changed on disk, as soon as i "warmup" the ARC with directory "tree2", reading the metadata from directory "tree1" does not get completely fetched from ARC or L2ARC anymore. I see lot's of IOPs going to the physical disks, though.

>From my understanding , L2ARC is an extension to ARC - so why doesn`t it work as expected ?
Is this a bug or feature or is that workload just too exotic (i don`t think so) ?

zfs-discuss mailing list
zfs-discuss at list.zfsonlinux.org[mailto:zfs-discuss at list.zfsonlinux.org]
zfs-discuss mailing list
zfs-discuss at list.zfsonlinux.org[mailto:zfs-discuss at list.zfsonlinux.org]

More information about the zfs-discuss mailing list