[zfs-discuss] L2ARC and SLOG on HW RAID with writeback cache

Richard Elling richard.elling at richardelling.com
Sun Apr 29 18:11:30 EDT 2018


> On Apr 29, 2018, at 12:48 PM, Richard Yao via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> 
> On Apr 29, 2018, at 2:58 PM, Richard Yao via zfs-discuss <zfs-discuss at list.zfsonlinux.org <mailto:zfs-discuss at list.zfsonlinux.org>> wrote:
> 
>> On Apr 29, 2018, at 2:50 PM, Edward Ned Harvey (zfsonlinux) via zfs-discuss <zfs-discuss at list.zfsonlinux.org <mailto:zfs-discuss at list.zfsonlinux.org>> wrote:
>> 
>>>> From: zfs-discuss <zfs-discuss-bounces at list.zfsonlinux.org <mailto:zfs-discuss-bounces at list.zfsonlinux.org>> On Behalf 
>>>> Of Gandalf Corvotempesta via zfs-discuss
>>>> 
>>>> This bitrot myth is totally nonsense today
>>> 
>>> I have seen both cases - I've seen environments like Gandalf describes, where bitrot simply never occurs, and I've seen environments like Gordon, Steve, Richard, and Durval describe, where it occurs. I've also seen environments where if it occurs, it could result in millions of dollars lost, and environments where if it occurs, nobody cares.
>>> 
>>> It certainly is related to the hardware, and related to the price of the hardware, but that's not a pure indicator. You can't just blindly assume expensive SAS hardware will not do it, nor can you assume cheap SATA disks will do it. It partly comes down to manufacturer specifics in specific models of disk and specific factories... It also comes down to climate in the datacenter, cable management within the system chassis (interference and cross-talk) and various other factors.
>> 
>> There is nothing in the hardware to protect against this. A misdirected write (likely caused by vibration) could be detected if a read is done afterward, but that has two problems. The first is that nobody does it because it hurts performance. The second is that there is no telling where the write went without stopping the world and scrutinizing everything (for several hours) and trying to make sense of how to fix it, which nobody does. It is in no way practical.
> 
> Just to add to my remark, this video demonstrates vibrations can cause misdirected IOs:
> 
> https://youtu.be/tDacjrSCeq4 <https://youtu.be/tDacjrSCeq4>
> 
> In that example, the vibrations are caused by yelling, but vibrations can come from anywhere, including other drives.

Actually, that video demonstrates how HDDs protect themselves from misdirected writes 
caused by vibration. It is this protection that shows itself as latency, because the disk must
rotate again before the write can be successfully completed.

> 
> The IOPS drop because reading blocks from the wrong places is easy for the drive to detect and correct by reissuing the read. Misdirected writes should go undetected, but the detection of misdirected reads demonstrates that misdirected IOs occur. A write is as capable of being thrown off track as a read is.

It is quite rate for misdirected reads or writes in modern storage devices, even those which do 
garbage collection: SMR HDDs or Flash SSDs. The majority of failures are in media and firmware.

>> 
>> That is not even talking about the case that more commonly occurs to people when they hear bitrot, which is sectors being damaged and going bad. That is the one case that traditional RAID is able to handle, but it is by no means the only issue, or the most common issue.
>> 
>>> There's no way to have an absolute guarantee (if you buy this type of hardware you won't be affected) so the easiest and cheapest thing to do is simply use filesystems that provide data integrity. Poof, problem solved.
>> 
>> Show me mechanical storage hardware and I can guarantee that I can find a way for something to go wrong with it.
>> 
>>> To emphasize this point (you can't just assume because of the hardware) search for intel errata. Even in ubiquitous enterprise standard hardware, errors occur, and manufacturing flaws get designed in. Not to mention manufacturing imperfections. I once had a CPU where one instruction (a single instruction, related to multitasking) was flawed. So the CPU passed all diagnostics, and could run the OS installer (which was single threaded), but still could not boot the OS, the system crashed every time it tried to start the first multi-tasked process in the system. And I've seen other hardware that would do weird shit like this... But only sometimes. Called "flaky" hardware. Enterprise or commodity, it can happen to them all, but less often on the enterprise. It's just a random probability distribution.
>> 
>> To add to this, ZFS has caught corruption caused by disk controllers.

yes, and firmware bugs in FC switches that T10 DIF will also catch :-) End-to-end checksumming
is important and we see that moving up the stack.
 -- richard

>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at list.zfsonlinux.org <mailto:zfs-discuss at list.zfsonlinux.org>
>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss <http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss>
>> 
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at list.zfsonlinux.org <mailto:zfs-discuss at list.zfsonlinux.org>
>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss <http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org <mailto:zfs-discuss at list.zfsonlinux.org>
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss <http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20180429/6fa052dd/attachment-0001.html>


More information about the zfs-discuss mailing list