[zfs-discuss] ZOL + SATA + Interposers + SAS Expanders is bad Mkay

Luka Morris luka.morris at aol.com
Mon Oct 14 19:48:21 EDT 2013

Dead Horse wrote:
History has a way of            repeating itself, this time with ZFS on Linux instead of            Solaris. In a nutshell using SATA drives + interposers  with            a SAS  expander + ZOL ends up in all kinds of fun and games.            I have spent a few hours here and there on a couple of lab            servers this past week. (J4400 Array + Seagate Moose Drives              + LSI SAS 9207-8E). I looked at some ways to try            and mitigate or fix this issue with ZOL. However digging            under the hood with this not so new issue + ZOL yields that            it is actually far more sinister under Linux then it was            with Solaris. 
Read on for the gory            details. (log file attached for the curious as well)
This issue is            fundamentally            caused by the SATA --> SAS protocol translation via            interposers to a SAS            expander. The irony here is that it takes ZFS to supply the            IO loading needed            to trip it up easily. Using MD+LVM or HW raid + LVM with            this type of setup was never able to create the perfect            storm. However I should also note some irony here in the            fact that I did try using BTRFS on this setup and like ZFS            it was also able bring this issue out.
Basically what is            happening here is that a hardware error occurs be it a read              error or a drive              overwhelmed with cache-sync commands a device or bus reset              is issued. What              happens next is that the single reset turns into *MANY*              resets. The              initial reset on the SAS expander causes all in-progress              IO operations to              abort. It should then generate a proper SAS hardware              error/sense code. Here is              where the problem starts. The interposers instead lose              things in protocol translation from SATA --> SAS and              instead return a generic hardware error/sense code. Thus              now              the Linux (sg) driver steps in and tries to be helpful by              issuing another reset in an effort              to right the ship. Thus given this scenario if you have a              lot of IO going on, EG: the kind of IO that ZFS              can generate with a disk subsystem like this needless to              say the ship never rights itself and instead makes a quick              trip to Davey Jones...
    I don't have really extensive experience with interposers but it seems to me interposers are more likely to do the correct thing than    the drives' firmwares. Probably because there are just a few models of    interposers all over the world, instead of changing every few months    like the drives, and they are aimed at the enterprise storage    business so they are probably well tested. When I added interposers    to a storage of ours, things improved compared to no    interposers (the expander we had could work both ways).
    I have the feeling your experience might be another version of    "drive X incompatible with interposer Y and controller Z" like there    are many, even without interposers, that's why you see hardware    compatibility lists around. Admittedly it's difficult to find an HCL    that also includes interposers in the middle.
    Another possibility is "drive X is faulty". Suppose a drive    responds with a generic error code when a reset is issued, and in    particular does that if it is still trying to read a defective    sector. Considering the Moose drives are not enterprise, I expect    not having ERC and hence the max time to recover a sector is usually    around 2 minutes.
    Now at the first defective sector (or some other kind of error that    might generate the first reset), a 30sec Linux default SCSI timeout    is reached, a reset is issued --> drive responds with generic    fault --> linux responds with another reset --> drive is still    trying to recover the sector so responds with another generic    fault... and so on. 
    The firmware of the drive might even "screw up" completely in such    unlikely (for a desktop-class drive) situation which is probably    scarcely-tested at Seagate labs (because it's a desktop-class    drive), If the firmware switches to an inconsistent state it might    keep responding nonsense until poweroff and power-on-again. In fact    I had a WD drive sporadically starting to respond bad and kept doing    so until the next power off and no kind of reset would fix this,    while the other drives of the same model never behaved bad in the    first place. Simply replacing that one fixed the thing i.e. it was    not the cabling.
    Now I hate Seagates so it must be their fault for sure :-)
    Can you check all drives with " smartctl -x /dev/sd... " looking    especially at the two sections named:
    "SMART Extended Comprehensive Error Log"
    "SATA Phy Event Counters (GP Log 0x11)"
    so to see if you have a drive significantly different from the    others in such 2 sections, showing more errors somehow, which can    indicate the culprit.
    Also can you run a smart long test in all drives to check if surface    is good for all drives? However note that this cannot completely rule    out a firmware bug / electronics bug happening on a specific disk.
    You might also want to raise the SCSI layer timeout for all drives to a very    large value such as 86400 (=24 hours) or anyway much higher than the    human intervention time on that storage system, so maybe next time you could see the thing happening live, stuck before the first reset.
    Can you determine the first drive which has given errors and caused    reset from dmesg or /var/log/dmesg?
    Can you confirm NCQ is disabled on such drives? I read here        http://en.wikipedia.org/wiki/Seagate_Barracuda    that NCQ behaviour is bugged on Moose drives and here        https://ata.wiki.kernel.org/index.php/Known_issues#Seagate_harddrives_which_time_out_FLUSH_CACHE_when_NCQ_is_being_used    that Linux should automatically disable NCQ on those drives, but    it's better to check. Anyway from http://en.wikipedia.org/wiki/Seagate_Barracuda    it seems Moose drives are deeply bugged so I wouldn't be surprised    if they turn out to be unsuitable for RAID/ZFS but a workaround    might be possible.
    One thing I don't understand in your story. You write:
    "The initial reset on the SAS expander causes all in-progress IO    operations to abort. It should then generate a proper SAS hardware    error/sense code. Here is where the problem starts. The interposers    instead lose things in protocol translation from SATA --> SAS and    instead return a generic hardware error/sense code. "
    I thought an expander reset should not reach up to the drives, it    should stop before, or should it?

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20131014/541b2c5d/attachment.html>

More information about the zfs-discuss mailing list