[zfs-discuss] zfs large pool - many checksum errors, no read or write errors

Francois Stark francois at postmasters.co.za
Fri May 13 10:59:25 EDT 2016


Over the past five years we have built four large zfs 0n linux servers, all on ubuntu 12 and 14.04; supermicro storage servers, with zfs pools ranging from 2x 8 disk raidz1 to the latest monster - 3x 10 disk raidz2, filled with WD NAS RED 6Tb disks. We have never seen anything like this on ZFS before.

So what do you make of our latest scrub feedback :

pool: saturnpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 12,5M in 27h32m with 0 errors on Sun May 1 04:32:43 2016
config:

NAME STATE READ WRITE CKSUM
saturnpool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD60EFRX-68MYMN1_WD-WX71D65JE3HN ONLINE 0 0 20
ata-WDC_WD60EFRX-68MYMN1_WD-WX71D65JEXK4 ONLINE 0 0 24
ata-WDC_WD60EFRX-68MYMN1_WD-WX71D65JEYP6 ONLINE 0 0 27
ata-WDC_WD60EFRX-68MYMN1_WD-WX81D6550AYV ONLINE 0 0 25
ata-WDC_WD60EFRX-68MYMN1_WD-WX81D6550CRK ONLINE 0 0 15
ata-WDC_WD60EFRX-68MYMN1_WD-WX81D6550R7L ONLINE 0 0 27
ata-WDC_WD60EFRX-68MYMN1_WD-WX91D65359U2 ONLINE 0 0 27
ata-WDC_WD60EFRX-68MYMN1_WD-WX91D6535C70 ONLINE 0 0 18
ata-WDC_WD60EFRX-68MYMN1_WD-WX91D6535RFL ONLINE 0 0 32
ata-WDC_WD60EFRX-68MYMN1_WD-WX91D6535SJJ ONLINE 0 0 31
raidz2-1 ONLINE 0 0 0
ata-WDC_WD60EFRX-68MYMN1_WD-WX91D6535Y18 ONLINE 0 0 28
ata-WDC_WD60EFRX-68MYMN1_WD-WX91D6535YK9 ONLINE 0 0 19
ata-WDC_WD60EFRX-68MYMN1_WD-WX91D65DCS2R ONLINE 0 0 23
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542060 ONLINE 0 0 37
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D65426T4 ONLINE 0 0 26
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542C71 ONLINE 0 0 26
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542CRZ ONLINE 0 0 21
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542DDT ONLINE 0 0 22
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542DY2 ONLINE 0 0 36
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542EJJ ONLINE 0 0 29
raidz2-2 ONLINE 0 0 0
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542EYN ONLINE 0 0 40
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542H5Y ONLINE 0 0 29
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542S04 ONLINE 0 0 24
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542SC0 ONLINE 0 0 30
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D6542ZNE ONLINE 0 0 26
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D65E34U9 ONLINE 0 0 26
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D65E3R25 ONLINE 0 0 24
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D65E3XL3 ONLINE 0 0 35
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D65E3Y62 ONLINE 0 0 24
ata-WDC_WD60EFRX-68MYMN1_WD-WXA1D65E3YJ9 ONLINE 0 0 28
cache
ata-Samsung_SSD_850_PRO_256GB_S251NXAH232543L ONLINE 0 0 0
ata-Samsung_SSD_850_PRO_256GB_S251NXAH232545F ONLINE 0 0 0
ata-Samsung_SSD_850_PRO_256GB_S251NXAH232550B ONLINE 0 0 0

errors: No known data errors



mmmmmm? Faulty lsi jbod card? Normally we see read or write errors causing checksum errors - so how do we get so many checksum errors without any r /w errors?

This points to software error? And the fact that we see so many errors across all disks? Impossible that all disks are faulty...

This server is meant to be a near-line backup server. We send 27 TB zfs filesystems to it over 1gig-ethernet and the do daily incremental zfs send | zfs receive to it.

Anybody seen so many checksum errors in zfs yet?

We have contacted supermicro - they replaced the LSI card (Avago LSI SAS3008), but the errors persist.  They are now blaming the disks - WD NAS RED 6TB.


Here are some more detail about the server:
Supermicro SSG-6048R-E1CR36L 4U RACK, 2 XE5-2600V3, 36X 3.5”

We are running the latest ZFS version from the repository ppa  zfs-native/stable

Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-36-generic x86_64)

03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

AME        PROPERTY                    VALUE                       SOURCE
saturnpool  size                        164T                        -
saturnpool  capacity                    20%                         -
saturnpool  altroot                     -                           default
saturnpool  health                      ONLINE                      -
saturnpool  guid                        16758994863278285701        default
saturnpool  version                     -                           default
saturnpool  bootfs                      -                           default
saturnpool  delegation                  on                          default
saturnpool  autoreplace                 off                         default
saturnpool  cachefile                   -                           default
saturnpool  failmode                    wait                        default
saturnpool  listsnapshots               on                          local
saturnpool  autoexpand                  off                         default
saturnpool  dedupditto                  0                           default
saturnpool  dedupratio                  1.00x                       -
saturnpool  free                        130T                        -
saturnpool  allocated                   33.7T                       -
saturnpool  readonly                    off                         -
saturnpool  ashift                      0                           default
saturnpool  comment                     -                           default
saturnpool  expandsize                  -                           -
saturnpool  freeing                     0                           default
saturnpool  fragmentation               13%                         -
saturnpool  leaked                      0                           default
saturnpool  feature at async_destroy       enabled                     local
saturnpool  feature at empty_bpobj         active                      local
saturnpool  feature at lz4_compress        active                      local
saturnpool  feature at spacemap_histogram  active                      local
saturnpool  feature at enabled_txg         active                      local
saturnpool  feature at hole_birth          active                      local
saturnpool  feature at extensible_dataset  enabled                     local
saturnpool  feature at embedded_data       active                      local
saturnpool  feature at bookmarks           enabled                     local
saturnpool  feature at filesystem_limits   enabled                     local
saturnpool  feature at large_blocks        enabled                     local


zfs get version saturnpool
NAME        PROPERTY  VALUE    SOURCE
saturnpool  version   5        -


Thanks



More information about the zfs-discuss mailing list