[zfs-discuss] ZOL + SATA + Interposers + SAS Expanders is bad Mkay

Dead Horse deadhorseconsulting at gmail.com
Fri Oct 11 16:55:49 EDT 2013

The irony here is the particular HW setup where I was most easily able to
bring this issue out to debug it was in essence a "fishworks" appliance.
ST31000340 drives behind LSI SS1320 interposers located within a stack of
J4400 hooked up to a LSI SAS 9207-E in a x4170M2 server. Interestingly I
found that the things got worse with Active-Active Mutli-pathing, a bit
less with Active-Passive, and lesser with no multipathing but none the less
ever present with this setup. Running the same on the exact setup except
replacing the Moose drives with some NL-SAS drives yielded a nicely working

I did a bit of chatting offline with Brian about this. He mentioned that
LLNL to had run into this and avoided using expanders and used NL-SAS to
avoid it.

He also noted they did take a look at making some improvements in the  the
error handling within the Linux SCSI mid-layer, but did not have
development time to spare on it.

The relevant error handling code is actually in drivers/scsi/scsi_error.c.
A thread is created for each attached SCSI host for error handling.
scsi_error_handler() function contains the main loop.
It can use for each host:
 scsi_unjam_host() --> (generic recovery code)
it can register its own error recovery handler.

I checked and the mpt2sas driver (in my case) uses  scsi_unjam_host() .

This is where a "workaround" would have to go much like what was done in

Andrew none the less I do agree with you that a fix or "workaround" in the
kernel is not a true fix, thus IMHO to avoid the nightmare in the first
place. I put this out there as a warning to any unfortunate soul using ZOL
and thinking about using this type of HW setup (or for those using it
already and wondering WTF is going on with their setup).


