[zfs-discuss] cannot import 'home': I/O error Destroy and re-create the pool from a backup source

Durval Menezes durval.menezes at gmail.com
Fri Apr 27 12:13:07 EDT 2018


Hello Anton, Bryn.

On Fri, Apr 27, 2018 at 12:54 PM, Bryn Hughes via zfs-discuss <
zfs-discuss at list.zfsonlinux.org> wrote:

> Glad you got it working!!
>

+1, and also thanks to Anton for posting a complete write-up of his
recovery attempts; I can see this being really useful to the next person
with this issue who comes here a-googling.


> Do we have any hints as to the cause of the initial corruption?
>

I'm  also interested in this, but given the *large* number of CKSUMs
reported, I'd bet Anton's machine had some data-corrupting hardware
issue(s)... like failing non-ECC RAM and/or faulty PSU and/or faulty disk
controller...

Cheers,
-- 
   Durval.

Bryn

On 2018-04-27 04:06 AM, Anton Gubar'kov via zfs-discuss wrote:

Hi, friends

my copying has completed. I could save each and every file from the video
archive dataset (I was most anxious to save it).
I could not save any of my lab VM images on zvols :-(. They are lab
machines anyway, I can rebuild them in some time.

Here is the final stats on checksum errors:
  pool: home
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 09:58:16 with 0 errors on Wed Mar 14
09:21:55 2018
config:

        NAME                            STATE     READ WRITE CKSUM
        home                            ONLINE       0     0 1.80K
          raidz2-0                      ONLINE       0     0 9.03K
            wwn-0x5000c500a41a0a00      ONLINE       0     0     0
            wwn-0x5000c500a41ae340      ONLINE       0     0     0
            wwn-0x5000c500a41b4c57      ONLINE       0     0 38.0K
            wwn-0x5000c500a41b7572      ONLINE       0     0     0
            wwn-0x5000c500a41ba99c      ONLINE       0     0     0
            wwn-0x5000c500a41babe8      ONLINE       0     0 37.5K
        logs
          wwn-0x30000d1700d9d40f-part2  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        home/VM/WIN10PRO-1 at MSI-enable:<0x1>
        home/VM/WIN10PRO-1:<0x1>
        home/users/anrdey:<0x0>

The procedure I used to get to this point:

   1. zpool import -F -o readonly=on home - failure
   2. zpool import -FX -o readonly=on home - failure
   3. many attepmts of zpool import -T <txg> -o readonly=on home - failure
   4. discovery of broken ZIL and attempt to import -m with all above
   combinations - failure
   5. Side suggestion from Richard to offline cache device, I removed the
   device file from /dev - failure
   6. following another thread,I dared to build zfs from
   https://github.com/zfsonlinux/zfs/pull/7459
   <https://github.com/zfsonlinux/zfs/pull/7459> (the branch link) and
   configured zfs_dbgmsg_enable=1 parameter
   7. Following /proc/spl/kstat/zfs/dbgmsg suggested that I have some txgs
   with very low metadata corruption (1-2 items), but don't have any txg
   completely clean.
   8. Chris suggested a way to ignore metadata corruption and try pool
   import anyway - echo 0 >/sys/module/zfs/parameters/
   spa_load_verify_metadata
   9. I used zdb -d -e home to find out txg data for the snapshots I had in
   my pool. I made a list of txgs for snapshots in my video dataset and
   started to do import -T <txg> -m -o readonly=on -R <mountpoint>. Corrupted
   txgs resulted in zfs kernel threads oopses and the host had to be rebooted.
   The 3rd tried txg resulted in a successful import and mounting of datasets.
   Bingo!
   10. I started copying the files from the datasets. I used rsync rather
   than zfs send/receive to see what files I could/couldn't salvage. I used dd
   to copy zvols to image files. I couldn't copy zvols due to io errors. I
   could copy all files from my video dataset.


Thanks everyone for helpful suggestions. I do hope that this thread could
help others in despair.




On Thu, Apr 26, 2018 at 11:01 PM Anton Gubar'kov <anton.gubarkov at iits.ru>
wrote:

> Dear friends,
> I used zdb -d to display data about the snapshots I have in the pool's
> datasets. I checked the txg numbers of the snapshots in the dataset I'm
> most anxious to recover and started my import attempts from the most recent
> working through to the past. The 3rd one proved to be working. I has to
> restart my host after every unsuccessful attempt due to zfs freeze.
>
> The copy still runs and will do for at least 6 more hours.
> The current zpool stat -v looks like:
>   pool: home
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://zfsonlinux.org/msg/ZFS-8000-8A
>   scan: scrub repaired 0B in 0 days 09:58:16 with 0 errors on Wed Mar 14
> 09:21:55 2018
> config:
>
>         NAME                            STATE     READ WRITE CKSUM
>         home                            ONLINE       0     0   122
>           raidz2-0                      ONLINE       0     0   514
>             wwn-0x5000c500a41a0a00      ONLINE       0     0     0
>             wwn-0x5000c500a41ae340      ONLINE       0     0     0
>             wwn-0x5000c500a41b4c57      ONLINE       0     0     7
>             wwn-0x5000c500a41b7572      ONLINE       0     0     0
>             wwn-0x5000c500a41ba99c      ONLINE       0     0     0
>             wwn-0x5000c500a41babe8      ONLINE       0     0     8
>         logs
>           wwn-0x30000d1700d9d40f-part2  ONLINE       0     0     0
>
> errors: Permanent errors have been detected in the following files:
>
>         home/users/anrdey:<0x0>
>
> I don't really care about home/users/anrdey dataset where permanent errors
> are reported. I don't understand the errors stats. What do checksum errors
> on pool and raidz2-0 vdev level mean? They keep growing while device-level
> checksum errors stay.
> There was no read error reported so far to the copying process (around 1TB
> of data is copied already). There are no messages in zfs debug log since
> the import had been completed.
>
> thanks
>
>
> On Thu, Apr 26, 2018 at 6:23 PM Raghuram Devarakonda via zfs-discuss <
> zfs-discuss at list.zfsonlinux.org> wrote:
>
>> On Thu, Apr 26, 2018 at 11:12 AM, Anton Gubar'kov via zfs-discuss
>> <zfs-discuss at list.zfsonlinux.org> wrote:
>> > Chris, thank you very much for the hint! After a couple of panics, I
>> could
>> > find the intact txg and import the pool rewinding it to one of the
>> > snapshots' txg. I'm copying the contents now. I understand that I may
>> not be
>> > able to copy everything, but this is better than loosing everything.
>>
>> That's great. Can you please describe how you figured out the valid txg?
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at list.zfsonlinux.org
>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>>
>

_______________________________________________
zfs-discuss mailing
listzfs-discuss at list.zfsonlinux.orghttp://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss



> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20180427/a2b64512/attachment.html>


More information about the zfs-discuss mailing list