[zfs-discuss] ZOL + SATA + Interposers + SAS Expanders is bad Mkay
luka.morris at aol.com
Mon Oct 14 19:48:21 EDT 2013
Dead Horse wrote:
History has a way of repeating itself, this time with ZFS on Linux instead of Solaris. In a nutshell using SATA drives + interposers with a SAS expander + ZOL ends up in all kinds of fun and games. I have spent a few hours here and there on a couple of lab servers this past week. (J4400 Array + Seagate Moose Drives + LSI SAS 9207-8E). I looked at some ways to try and mitigate or fix this issue with ZOL. However digging under the hood with this not so new issue + ZOL yields that it is actually far more sinister under Linux then it was with Solaris.
Read on for the gory details. (log file attached for the curious as well)
This issue is fundamentally caused by the SATA --> SAS protocol translation via interposers to a SAS expander. The irony here is that it takes ZFS to supply the IO loading needed to trip it up easily. Using MD+LVM or HW raid + LVM with this type of setup was never able to create the perfect storm. However I should also note some irony here in the fact that I did try using BTRFS on this setup and like ZFS it was also able bring this issue out.
Basically what is happening here is that a hardware error occurs be it a read error or a drive overwhelmed with cache-sync commands a device or bus reset is issued. What happens next is that the single reset turns into *MANY* resets. The initial reset on the SAS expander causes all in-progress IO operations to abort. It should then generate a proper SAS hardware error/sense code. Here is where the problem starts. The interposers instead lose things in protocol translation from SATA --> SAS and instead return a generic hardware error/sense code. Thus now the Linux (sg) driver steps in and tries to be helpful by issuing another reset in an effort to right the ship. Thus given this scenario if you have a lot of IO going on, EG: the kind of IO that ZFS can generate with a disk subsystem like this needless to say the ship never rights itself and instead makes a quick trip to Davey Jones...
I don't have really extensive experience with interposers but it seems to me interposers are more likely to do the correct thing than the drives' firmwares. Probably because there are just a few models of interposers all over the world, instead of changing every few months like the drives, and they are aimed at the enterprise storage business so they are probably well tested. When I added interposers to a storage of ours, things improved compared to no interposers (the expander we had could work both ways).
I have the feeling your experience might be another version of "drive X incompatible with interposer Y and controller Z" like there are many, even without interposers, that's why you see hardware compatibility lists around. Admittedly it's difficult to find an HCL that also includes interposers in the middle.
Another possibility is "drive X is faulty". Suppose a drive responds with a generic error code when a reset is issued, and in particular does that if it is still trying to read a defective sector. Considering the Moose drives are not enterprise, I expect not having ERC and hence the max time to recover a sector is usually around 2 minutes.
Now at the first defective sector (or some other kind of error that might generate the first reset), a 30sec Linux default SCSI timeout is reached, a reset is issued --> drive responds with generic fault --> linux responds with another reset --> drive is still trying to recover the sector so responds with another generic fault... and so on.
The firmware of the drive might even "screw up" completely in such unlikely (for a desktop-class drive) situation which is probably scarcely-tested at Seagate labs (because it's a desktop-class drive), If the firmware switches to an inconsistent state it might keep responding nonsense until poweroff and power-on-again. In fact I had a WD drive sporadically starting to respond bad and kept doing so until the next power off and no kind of reset would fix this, while the other drives of the same model never behaved bad in the first place. Simply replacing that one fixed the thing i.e. it was not the cabling.
Now I hate Seagates so it must be their fault for sure :-)
Can you check all drives with " smartctl -x /dev/sd... " looking especially at the two sections named:
"SMART Extended Comprehensive Error Log"
"SATA Phy Event Counters (GP Log 0x11)"
so to see if you have a drive significantly different from the others in such 2 sections, showing more errors somehow, which can indicate the culprit.
Also can you run a smart long test in all drives to check if surface is good for all drives? However note that this cannot completely rule out a firmware bug / electronics bug happening on a specific disk.
You might also want to raise the SCSI layer timeout for all drives to a very large value such as 86400 (=24 hours) or anyway much higher than the human intervention time on that storage system, so maybe next time you could see the thing happening live, stuck before the first reset.
Can you determine the first drive which has given errors and caused reset from dmesg or /var/log/dmesg?
Can you confirm NCQ is disabled on such drives? I read here http://en.wikipedia.org/wiki/Seagate_Barracuda that NCQ behaviour is bugged on Moose drives and here https://ata.wiki.kernel.org/index.php/Known_issues#Seagate_harddrives_which_time_out_FLUSH_CACHE_when_NCQ_is_being_used that Linux should automatically disable NCQ on those drives, but it's better to check. Anyway from http://en.wikipedia.org/wiki/Seagate_Barracuda it seems Moose drives are deeply bugged so I wouldn't be surprised if they turn out to be unsuitable for RAID/ZFS but a workaround might be possible.
One thing I don't understand in your story. You write:
"The initial reset on the SAS expander causes all in-progress IO operations to abort. It should then generate a proper SAS hardware error/sense code. Here is where the problem starts. The interposers instead lose things in protocol translation from SATA --> SAS and instead return a generic hardware error/sense code. "
I thought an expander reset should not reach up to the drives, it should stop before, or should it?
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the zfs-discuss