[zfs-discuss] zfs large pool - many checksum errors, no read or write errors

Badi' Abdul-Wahid abdulwahidc at gmail.com
Mon May 16 19:27:05 EDT 2016


On Mon, May 16, 2016 at 5:16 PM, Alex Braunegg <alex.braunegg at gmail.com>
wrote:

> Hi,
>
>
>
> I get something highly similar with my test HP N40L, however no checksum
> errors, read or write errors:
>
>
>
> Controllers:
>
>
>
> 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI]
> SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40)
>
> 02:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe SATA
> 6Gb/s Controller (rev 10)
>
>
>
> Disk Layout:
>
>
>
> H/W path      Device      Class      Description
>
> ================================================
>
> /0/1/0.0.0    /dev/sda    disk       1TB ST31000340NS
>
> /0/2/0.0.0    /dev/sdb    disk       1TB ST31000340NS
>
> /0/4/0.0.0    /dev/sdc    disk       1TB ST31000340NS
>
> /0/5/0.0.0    /dev/sdd    disk       1TB ST31000340NS
>
> /0/6/0.0.0    /dev/sde    disk       320GB TOSHIBA MQ01ABD0
>
> /0/7/0.0.0    /dev/sdf    disk       3TB WDC WD30EFRX-68A
>
> /0/8/0.0.0    /dev/sdg    disk       320GB TOSHIBA MQ01ABD0
>
>
>
> Disks 1,2,4,5 are connected to the onboard AMD SATA controller, 6,7 & 8 to
> the Marvel. The 2 x 320GB’s are in a software RAID mirror for boot.
>
>
>
> The errors I periodically get are:
>
>
>
> ata5: hard resetting link
>
> ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>
> ata7: hard resetting link
>
> ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>
ata7: hard resetting link
>
ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>
Do your disks start off with a 6 Gbps SATA link then fall back to 3 Gbps
after a while?
I'd be curious to know what smartctl reports your SATA version to be before
and after for these disks.
I get this:
Before: SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
After: SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)


>
> I had originally put it down to the data bus being busy as it always
> seemed to coincide with when smart tests were running and the tail end of a
> weekly pool scrub.
>
For me it is always at the beginning of the scrub.
I can also induce it by
dd if=/dev/disk-by/id/ata-WDC_WD2001FFSX-68JNUN0_WD-WMC5C0D2RSF3
of=/dev/null bs=64 count=96

>
>
>
>
> Best regards,
>
>
>
> Alex
>
>
>
>
>
>
>
> *From:* zfs-discuss [mailto:zfs-discuss-bounces at list.zfsonlinux.org] *On
> Behalf Of *Badi' Abdul-Wahid via zfs-discuss
> *Sent:* Tuesday, 17 May 2016 6:49 AM
> *To:* zfs-discuss at list.zfsonlinux.org
> *Subject:* Re: [zfs-discuss] zfs large pool - many checksum errors, no
> read or write errors
>
>
>
> Francois, here is an instance of the errors I get.
>
>
>
>
>
> May 15 07:15:02 namo kernel: blk_update_request: I/O error, dev sda,
> sector 1930956344
>
> May 15 07:15:02 namo kernel: ata1: EH complete
>
> May 15 07:16:18 namo kernel: ata1.00: exception Emask 0x10 SAct 0x10000
> SErr 0x280100 action 0x6 frozen
>
> May 15 07:16:18 namo kernel: ata1.00: irq_stat 0x08000000, interface fatal
> error
>
> May 15 07:16:18 namo kernel: ata1: SError: { UnrecovData 10B8B BadCRC }
>
> May 15 07:16:18 namo kernel: ata1.00: failed command: READ FPDMA QUEUED
>
> May 15 07:16:18 namo kernel: ata1.00: cmd
> 60/00:80:c8:58:23/01:00:74:00:00/40 tag 16 ncq 131072 in
>
>                                       res
> 40/00:84:c8:58:23/00:00:74:00:00/40 Emask 0x10 (ATA bus error)
>
> May 15 07:16:18 namo kernel: ata1.00: status: { DRDY }
>
> May 15 07:16:18 namo kernel: ata1: hard resetting link
>
>
>
> I see several instances of these until
>
> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
>
> after which the errors subside.
>
>
>
> Looking at smartctl for the disk:
>
> ata-WDC_WD3001FFSX-68JNUN0_WD-WMC1F0E426P7 SATA Version is:  SATA 3.0, 6.0
> Gb/s (current: 3.0 Gb/s)
>
>
>
>
>
>
>
> On Mon, May 16, 2016 at 12:36 PM, Badi' Abdul-Wahid <abdulwahidc at gmail.com>
> wrote:
>
> Francois Stark <francois at postmasters.co.za> writes:
>
> > ... Thanks for the suggestion - but wouldn't I be getting read or write
> errors?
> >
> > I am only getting ZFS checksum errors - no read or write errors. I also
> don't see any errors in the kernel for the disks.
> >
> > Can you paste the kind of errors you have found with the WD disks?
>
> Sure, I'll send it this evening when I get a chance.
>
>
> >
> > Thanks
> >
> > ________________________________________
> > From: zfs-discuss [zfs-discuss-bounces at list.zfsonlinux.org] On Behalf
> Of Badi' Abdul-Wahid
> > Sent: 16 May 2016 04:27 PM
> >
> > This looks similar to something I see on my system.
> > I've got a couple of the WD FFSX drives (same as yours but 7200rpm).
> > Do you see any errors in your kernel logs regarding these disks?
> > For me, libata reports bad sectors and I can always reproduce these
> errors when doing a dd test or a scrub.
> > After a few minutes, the link speed gets downgraded from 6 to 3 Gbps and
> the issue subsides.
> > I've been able to "hide" the problem by setting libata.force=3 in the
> kernel parameters during boot.
> > Can you pull a couple out and connect them directly to the motherboard?
> > When I tested this, skipping the backplane board, the issue disappeared
> as well.
> > At this point I'm in the process of replacing my WD Red drives.
> > My guess at this point is that the vibration is beyond what the disks
> can handle, despite the manufacturer claims.
> >
> > FWIW, my pool is a mix of WD, HGST, and Seagate disks in 3 mirrors of
> two disks.
> >
>
> --
> Badi' Abdul-Wahid
>
>
>
>
>
> --
>
>
> Badi' Abdul-Wahid
>



-- 

Badi' Abdul-Wahid
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20160516/14f9a715/attachment-0001.html>


More information about the zfs-discuss mailing list