[zfs-discuss] Unlistable files

Andreas Dilger adilger at dilger.ca
Fri Apr 6 13:55:17 EDT 2018


On Apr 6, 2018, at 10:47 AM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
> 
> A new symptom I noticed is that I am not longer able to access the
> un-listable files by path directly if I run "echo 3 >
> /proc/sys/vm/drop_caches".

This implies that the files are not actually written to the disk, but were
only in the dcache of the system and were removed when you dropped cache.
Something fairly seriously broken if that is the case.

> I ran the strace command and I don't see getdents returning 0. The names
> of the missing files do not appear in the output of strace at all, so it
> looks like the kernel does not return them.

If they were only in dcache, but not actually on disk, then they wouldn't
appear in the directory listing, since the directory is always generated
by the filesystem, while name lookups may be resolved from cache.

> I am not sure if the problem only happens with big directories. The
> machine this is happening on is a file server, and there seem to be more
> "file not found" errors than usual in the logs, but I can't tell if that
> is caused by the same issue, or if the clients were simply trying to
> open files that were never uploaded in the first place.

Could you post an example of the "file not found" errors?  What generates
those messages?  From your other email, it looks like the directories are
about 7000 files each?  I wouldn't say that is too large, but I think this
is less relevant in light of the above comments.

>> You'd could use "zdb" to dump the
>> directory to confirm the entry is there
> How do I do this?

Sorry, I don't know much about zdb, just that it _can_ do this kind of
debugging...  Maybe someone with zdb-fu could help?  That said, I'm not
sure this information is useful anymore.  It seems like the root problem
is that the files are not even written to disk, not that there is a problem
returning them from the directory.

My next suggestion (if you can reproduce this reliably with some test script)
is to use git bisect to build different zfs.ko modules and isolate the
problem to a specific patch, so it can be reverted or fixed.

Definitely it is also time to file an issue at Github, and make it clear
this is a data loss/corruption issue.

Cheers, Andreas

> 
> 
> On 04/05/2018 04:58 PM, Andreas Dilger wrote:
>> On Apr 5, 2018, at 2:34 PM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>> 
>>> Hello.
>>> 
>>> I have run into a strange issue where files don't show up in directory
>>> listing but can be accessed by path directly. I wonder if somebody knows
>>> what might have caused this.
>>> 
>>> # find dst/a/foo
>>> dst/a/foo
>>> (as expected)
>>> 
>>> # find dst/a/ -name foo
>>> (no output)
>>> 
>>> # ls -l dst/a/foo
>>> -rw-r--r-- 1 xxx xxx 5991051 Feb 22 13:35 dst/a/foo
>>> (as expected)
>>> 
>>> # ls -l dst/a/ | grep foo
>>> (no output)
>>> 
>>> # cp dst/a/foo bar
>>> (this works; bar is created and can be listed)
>> 
>> There are a few potential issues that might cause this.  One is if
>> getdents() returns from the kernel with d_ino == 0, then "ls" and
>> other directory walking tools will skip the entry as "deleted" for
>> historical reasons.
>> 
>> It might also be that "ls" and ZFS directory iteration do not play well
>> together, skipping some entries in the directory (e.g. hash collisions,
>> or if telldir() and seekdir() do not work properly).  If your problem
>> only happens on large directories then this is a possibility.
>> 
>> Run your "ls -l dst/a/" under strace and/or ltrace to see if these
>> entries are being returned from the kernel, but not printed by "ls",
>> or if they are not being returned by the kernel at all.  Something like:
>> 
>>    strace -f -e trace=open,getdents,lstat -v -y ls -l dst/a/
>> 
>> The exact system calls for getdents() and lstat() may depend on your
>> kernel and userspace libraries.  Note that this will suppress all of
>> the other systemcalls, but makes the output more readable.
>> 
>> Another possibility is a bug in the ZFS ZAP processing code, which does
>> not iterate over the entries properly, and doesn't return the names to
>> userspace via getdents() at all.  You'd could use "zdb" to dump the
>> directory to confirm the entry is there (it pretty much *HAS* to be, if
>> the "dst/a/foo" lookup works).  At that point, running with tracepoints,
>> or adding printk() debug messages and rebuilding the zfs.ko module would
>> help debug where the problem is.
>> 
>> Cheers, Andreas
>> 
>>> The problem occurs when I run something like "cp -r src dst", where src
>>> is a directory with 12 sub-directories with 6999 files each, about 84K
>>> files total, 2.9TB. After copy finishes, dst is missing several thousand
>>> files according to find. (Similar thing happened when I tarred src and
>>> then unpacked it in a different location; according to tar --list the
>>> tarball contained all files.)
>>> 
>>> The cp command reported "No space left on device" for a couple of files.
>>> The filesystem has about 80TB free (zpool is about 50% full). The files
>>> for which "No space left on device" error was generated just weren't
>>> created, it seems, but other missing files are accessible by their full
>>> path but did not show up in directory listings (as shown above).
>>> 
>>> ls is reporting some sub-directories of dst have 7000 hard links instead
>>> of 7001 that the sub-directories in src have. All missing files seem to
>>> be from such sub-directories.
>>> 
>>> After rebooting the server, the missing were no longer accessible by
>>> full path.
>>> 
>>> It seems the problem is reproducible. Missing files are not always the same.
>>> 
>>> I am running ZFS 0.7.7, Scientific Linux release 6.8. No ZFS snapshots.
>>> 
>>> If anybody can shed light on this, I would really appreciate it :)
>>> 
>>> 
>>> Vlad
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at list.zfsonlinux.org
>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>> 
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 


Cheers, Andreas





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 873 bytes
Desc: Message signed with OpenPGP
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20180406/c9e220ab/attachment.sig>


More information about the zfs-discuss mailing list