[zfs-discuss] cache and log drive failure impact on running system.

Uncle Stoatwarbler stoatwblr at gmail.com
Fri Dec 18 15:24:15 EST 2015

On 18/12/15 17:58, Muhammad Yousuf Khan via zfs-discuss wrote:
> Thanks for your input guys, my question is still their. can you guyz
> please share if any of them or both SSDs goes down

Assuming you mean the SLOG and L2ARC

1: Loss of either device during normal operation is a non-event - they
simply get kicked out of the pool and the system will continue, possibly
running a little slower. ZIL synchronous writes will go to the main pool
drives instead of the SLOG device and/or l2arc caching is no longer
there, depending which device fell over

2: Power loss or other catastrophic failure will cause _some_ data loss
- namely the non-synchronous cached writes that were pending at the time
of the crash/outage. Because ZFS is a transactional filesystem (ie,
writes either complete and mark themselves as such, or the FS rolls back
to the previous transaction), there will never filesystem corruption and
the effect of a catastrophic system failure is "Loss of data written in
the last few seconds (normally 5s maximum) before the power went off"

2a: Synchronous writes are written to the ZIL (or SLOG drive if it
exists) and in the event of a power failure before they got committed to
the main pool, will be replayed at startup.

3: Loss of l2arc at powerup after a power outage is a non-event. The
system will simply carry on with less cache space than it had (l2arc is
repopulated from zero at each restart)

4: Follows from 2a. If the SLOG device fails to come back when the
system is restarted after a catastrophic failure, you revert to section
2 behaviour - ie, some data loss, the last few seconds committed before
the power went off.

Given that you had a catastrophic failure, there should be a recovery
procedure and rolling back a few seconds before the outage isn't
normally a major problem in such a case.

> does it corrupt the data

There is a fundamental difference between _filesystem_ integrity in the
face of catastrophic failure and _data_ integrity in the same event.

That being: Loss of the filesystem is a major issue. You've lost
everything and it's time to start hauling out backup tapes.

Loss of a few seconds' data may or may not be a big deal. If that was
high-frequency-trading database write then it could be extremely
serious, but if it was a failed file write it's likely recoverable by
rerunning the last commands before the power went off.

Your choices regarding data integrity are (in order of increasing cost)
to allow non-synchronous write caching, turn all non-synch writes into
synchronous writes (experience here shows this WILL slow down writing
dramatically, even if you use RAM-based SSDs) and if you're paranoid,
double up on your SLOG device (ie, raid1)

This is akin to data=[writeback/ordered/journal] on an EXT4 filesystem
and the way such a FS recovers from a system crash.

Note that most appliance vendors will push extremely expensive RAM-based
ZIL devices(*) at customers and then double them up when the underlaying
sata/sas structure would have trouble keeping up with 64Gb MLC or SLC

(*) Zeusram devices are about $2800 apiece. Putting two in becomes a
_large_ chunk of the device costs that can easily be better spent on
adding more ram when a Kingston 64GB ZIL-dedicated SSD is less than 1/10
that price and works just as well for most purposes. Bear in mind that
outside of heavy duty enterprise database operations, sync writes tend
to account for less than 0.1% of disk write activity. (and in such a
critical operation you'd expect databases to be hot-replicating anyway)

> and what people may do in case both or single drive goes down and
> need to bring the system to normal condition.

"replace the drive that failed"

A ZIL/SLOG drive is an _intent_ log. You write your intention to the ZIL
(or SLOG if that's installed) immediately, then write out the data to
the pool later, then wipe the intent log when the pool commit is
complete. Unless the power goes off the ZIL is a write-only circular
buffer. Nothing is ever there that isn't also in the system ram awaiting
a commit to disk.

If the SLOG drive dies during normal operation the system reverts to
writing ZIL data to a chunk of the main pool (SSD SLOG is there to speed
things up. It's not a critical device)

L2arc is a read cache. Losing it shrinks your cache. That's all.

The fundamental design ethos of ZFS is "Disks are crap. Cope with it" -
and it does so extremely well.

> On Fri, Dec 18, 2015 at 10:52 PM, Bryn Hughes via zfs-discuss
> <zfs-discuss at list.zfsonlinux.org
> <mailto:zfs-discuss at list.zfsonlinux.org>> wrote:
>     On 2015-12-17 10:44 AM, Ruben Kelevra via zfs-discuss wrote:
>         Well actually a log-device is a spof. When it's plugged out you
>         will remain in a inconsistent storage condition.
>     I don't believe this is a true statement.
>     The log device is never read from under normal operations.  When
>     synchronous writes are called for ZFS writes to the log device,
>     tells the application the data is on stable storage and then
>     proceeds to write to the main disk from RAM just like normal (while
>     meanwhile the application has been able to continue on doing its
>     thing, not having to wait for the data to actually land on the main
>     disk).  The Log device only gets read from in the event of a power
>     failure or crash - when the system starts up again it will check the
>     log device for any uncommitted transactions but ZFS otherwise never
>     reads from it.
>     Bryn
>     _______________________________________________
>     zfs-discuss mailing list
>     zfs-discuss at list.zfsonlinux.org <mailto:zfs-discuss at list.zfsonlinux.org>
>     http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss

More information about the zfs-discuss mailing list