[zfs-discuss] L2ARC and SLOG on HW RAID with writeback cache

Richard Yao ryao at gentoo.org
Sun Apr 29 14:58:53 EDT 2018

On Apr 29, 2018, at 2:50 PM, Edward Ned Harvey (zfsonlinux) via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:

>> From: zfs-discuss <zfs-discuss-bounces at list.zfsonlinux.org> On Behalf 
>> Of Gandalf Corvotempesta via zfs-discuss
>> This bitrot myth is totally nonsense today
> I have seen both cases - I've seen environments like Gandalf describes, where bitrot simply never occurs, and I've seen environments like Gordon, Steve, Richard, and Durval describe, where it occurs. I've also seen environments where if it occurs, it could result in millions of dollars lost, and environments where if it occurs, nobody cares.
> It certainly is related to the hardware, and related to the price of the hardware, but that's not a pure indicator. You can't just blindly assume expensive SAS hardware will not do it, nor can you assume cheap SATA disks will do it. It partly comes down to manufacturer specifics in specific models of disk and specific factories... It also comes down to climate in the datacenter, cable management within the system chassis (interference and cross-talk) and various other factors.

There is nothing in the hardware to protect against this. A misdirected write (likely caused by vibration) could be detected if a read is done afterward, but that has two problems. The first is that nobody does it because it hurts performance. The second is that there is no telling where the write went without stopping the world and scrutinizing everything (for several hours) and trying to make sense of how to fix it, which nobody does. It is in no way practical.

That is not even talking about the case that more commonly occurs to people when they hear bitrot, which is sectors being damaged and going bad. That is the one case that traditional RAID is able to handle, but it is by no means the only issue, or the most common issue.

> There's no way to have an absolute guarantee (if you buy this type of hardware you won't be affected) so the easiest and cheapest thing to do is simply use filesystems that provide data integrity. Poof, problem solved.

Show me mechanical storage hardware and I can guarantee that I can find a way for something to go wrong with it.

> To emphasize this point (you can't just assume because of the hardware) search for intel errata. Even in ubiquitous enterprise standard hardware, errors occur, and manufacturing flaws get designed in. Not to mention manufacturing imperfections. I once had a CPU where one instruction (a single instruction, related to multitasking) was flawed. So the CPU passed all diagnostics, and could run the OS installer (which was single threaded), but still could not boot the OS, the system crashed every time it tried to start the first multi-tasked process in the system. And I've seen other hardware that would do weird shit like this... But only sometimes. Called "flaky" hardware. Enterprise or commodity, it can happen to them all, but less often on the enterprise. It's just a random probability distribution.

To add to this, ZFS has caught corruption caused by disk controllers.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss

More information about the zfs-discuss mailing list