[zfs-discuss] Need help fixing... ???

Durval Menezes durval.menezes at gmail.com
Thu Oct 31 05:05:16 EDT 2013


Hi ChessDoter,

On Thu, Oct 31, 2013 at 5:48 AM, UbuntuNewbie <chessdoter at googlemail.com>wrote:

> Hello list,
>
>  Well, it is purely for ZoL that i changed my main OS (and hardware) a
> couple of years ago, and - even though i am only a simple user - i am very
> happy with that decision!
>
> Only recently i ran into several (mostly minor) problems, the most severe
> i want to lay out here:
>
> Background:
> On my Ubuntu LTS System, running from a small SSD partition, i do
> regularly run an rsync (from /) into a raidz pool on spinning rust. There,
> i take snapshots and also run backups from there (send/recv).
>
> Problem:
> from a couple of days ago, one rsync produced a write error (writing into
> the pool) and created a state, that inhibits further working with this
> setup. (To be fair: i dont know, if rsync was the culprit or maybe zfs?
> Anyway, the damage done needs to be fixed...)
>

>From what I see below you have some kind of logical inconsistency in
filesystem metadata. I don't think there's ANY way rsync could be the
culprit (unless something really braindead happened, like you giving it the
/dev raw device as an argument to work with -- which I don't think you
would do). It's almost certainly hardware (most probable: bad memory, bad
controller, bad cabling, bad power supply, etc)  or something in the kernel
(buggy controller driver, etc), or perhaps ZFS (least probable of all).

What can you tell us regarding your hardware setup?



> Observations/Procedures:
> TWO files couldnt be written, at least their directory entry/inode seams
> to be corrupted. As in
>
> $ ll -a
> ls: Zugriff auf mach-gt64120.h nicht möglich: Datei oder Verzeichnis nicht
> gefunden
> ls: Zugriff auf mc146818rtc.h nicht möglich: Datei oder Verzeichnis nicht
> gefunden
> insgesamt 22
> drwxr-xr-x  2 root root  4 Okt 29 09:47 ./
> drwxr-xr-x 26 root root 32 Okt 22 04:15 ../
> -?????????  ? ?    ?     ?            ? mach-gt64120.h
> -?????????  ? ?    ?     ?            ? mc146818rtc.h
>

Looks mighty like metadata corruption to me. I've seen very similar things
in other filesystems being caused by hardware problems, but never in ZFS
(doesn't ZFS store 2 copies of all metadata exactly to make something like
that almost impossible?).



> As deleting wasnt possible, see:
>
> $ cd ..;sudo rm -Rv mach-malta-old/
> rm: das Entfernen von »mach-malta-old/mach-gt64120.h“ ist nicht möglich:
> Datei oder Verzeichnis nicht gefunden
> rm: das Entfernen von »mach-malta-old/mc146818rtc.h“ ist nicht möglich:
> Datei oder Verzeichnis nicht gefunden
>
> I chose to move the directory to another place inside this filesystem.
>
> But how to get rid of it?
> I came up with zfs send > file;zfs recv < file
> but that produced an abortion during zfs recv stating the file would be
> corrupted and have checksum errors.
> Next, i did a scrub on the whole pool containing the filesystem, which
> ended without any errors or actions.
> That is where i am at now.
>

I would do a "tar", specifically excluding the directory containing the
messed-up files.



>
> Questions:
> my plan to destroy the filesystem and to replace it with an intact one
> failed. How should i proceed instead to fix the current state?
>

If I were in your shoes and I had time (ie, the issue is not critical
enough to demand correction *now*, and I could afford to stop using the
filesystem for anything in the mean time), I would *not* destroy the FS
before trying to understand what happened, specifically in regards to the
*cause*: it's my observation that simply ignoring those kinds of issues
almost always guarantee that they will come back to bite you in the rear...



> my scripts (to create the OS backup & snapshots, and later back them up to
> external media) have been running since a long time. the latest change
> though was:
> i included --inplace parameter to the rsync (as suggested in some other
> post on this list), and that did work fine for several weeks. What actions
> should i take to possibly prevent this issue from coming up again?
>

I don't think that "--inplace" could be responsible in any way for that
issue.



> Any diagnosis suggested/recommended (please keep in mind, i am ONLY a user)
>

Some suggestions:
0) DO A BACKUP (see my suggestion above on skipping the problem area) if
you didn't already. Treat that filesystem as something very delicate that
could break completely at any moment.
1) Do you log kernel messages? Look in the log file for anything strange
prior to the first time you noticed that error.
2) If you don't what does "dmesg|tail -30"  shows when you do the "ls"
above and get those errors?
3) What does "smartctl -a" on the underlying disks show?
4) What does a "zpool status" report after a  "zpool scrub"?

Whatever happens, please keep us posted.

Cheers, and good luck,
-- 
   Durval.



>
> Thx in advance. The other questions/problems, i will write a different
> post for.
> cheers, U.N.
>
>  To unsubscribe from this group and stop receiving emails from it, send an
> email to zfs-discuss+unsubscribe at zfsonlinux.org.
>

To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscribe at zfsonlinux.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20131031/c039d15a/attachment.html>


More information about the zfs-discuss mailing list