[zfs-discuss] Unlistable files

Vladimir Brik vladimir.brik at icecube.wisc.edu
Fri Apr 6 15:19:21 EDT 2018


> Could you post an example of the "file not found" errors?  What generates
> those messages?
Those appear in the logs of gridftp (application we use for file
transfers). They are just regular messages that gridftp server records
when a client requests a file that doesn't exist. It is possible the
clients request files that were never uploaded or were legitimately
deleted, so I don't know to what degree this is related to zfs, if at all.

Thanks very much for everybody's comments. I'll create a github ticket
for this.


Vlad



On 04/06/2018 12:55 PM, Andreas Dilger wrote:
> On Apr 6, 2018, at 10:47 AM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>
>> A new symptom I noticed is that I am not longer able to access the
>> un-listable files by path directly if I run "echo 3 >
>> /proc/sys/vm/drop_caches".
> 
> This implies that the files are not actually written to the disk, but were
> only in the dcache of the system and were removed when you dropped cache.
> Something fairly seriously broken if that is the case.
> 
>> I ran the strace command and I don't see getdents returning 0. The names
>> of the missing files do not appear in the output of strace at all, so it
>> looks like the kernel does not return them.
> 
> If they were only in dcache, but not actually on disk, then they wouldn't
> appear in the directory listing, since the directory is always generated
> by the filesystem, while name lookups may be resolved from cache.
> 
>> I am not sure if the problem only happens with big directories. The
>> machine this is happening on is a file server, and there seem to be more
>> "file not found" errors than usual in the logs, but I can't tell if that
>> is caused by the same issue, or if the clients were simply trying to
>> open files that were never uploaded in the first place.
> 
> Could you post an example of the "file not found" errors?  What generates
> those messages?  From your other email, it looks like the directories are
> about 7000 files each?  I wouldn't say that is too large, but I think this
> is less relevant in light of the above comments.
> 
>>> You'd could use "zdb" to dump the
>>> directory to confirm the entry is there
>> How do I do this?
> 
> Sorry, I don't know much about zdb, just that it _can_ do this kind of
> debugging...  Maybe someone with zdb-fu could help?  That said, I'm not
> sure this information is useful anymore.  It seems like the root problem
> is that the files are not even written to disk, not that there is a problem
> returning them from the directory.
> 
> My next suggestion (if you can reproduce this reliably with some test script)
> is to use git bisect to build different zfs.ko modules and isolate the
> problem to a specific patch, so it can be reverted or fixed.
> 
> Definitely it is also time to file an issue at Github, and make it clear
> this is a data loss/corruption issue.
> 
> Cheers, Andreas
> 
>>
>>
>> On 04/05/2018 04:58 PM, Andreas Dilger wrote:
>>> On Apr 5, 2018, at 2:34 PM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>>>
>>>> Hello.
>>>>
>>>> I have run into a strange issue where files don't show up in directory
>>>> listing but can be accessed by path directly. I wonder if somebody knows
>>>> what might have caused this.
>>>>
>>>> # find dst/a/foo
>>>> dst/a/foo
>>>> (as expected)
>>>>
>>>> # find dst/a/ -name foo
>>>> (no output)
>>>>
>>>> # ls -l dst/a/foo
>>>> -rw-r--r-- 1 xxx xxx 5991051 Feb 22 13:35 dst/a/foo
>>>> (as expected)
>>>>
>>>> # ls -l dst/a/ | grep foo
>>>> (no output)
>>>>
>>>> # cp dst/a/foo bar
>>>> (this works; bar is created and can be listed)
>>>
>>> There are a few potential issues that might cause this.  One is if
>>> getdents() returns from the kernel with d_ino == 0, then "ls" and
>>> other directory walking tools will skip the entry as "deleted" for
>>> historical reasons.
>>>
>>> It might also be that "ls" and ZFS directory iteration do not play well
>>> together, skipping some entries in the directory (e.g. hash collisions,
>>> or if telldir() and seekdir() do not work properly).  If your problem
>>> only happens on large directories then this is a possibility.
>>>
>>> Run your "ls -l dst/a/" under strace and/or ltrace to see if these
>>> entries are being returned from the kernel, but not printed by "ls",
>>> or if they are not being returned by the kernel at all.  Something like:
>>>
>>>    strace -f -e trace=open,getdents,lstat -v -y ls -l dst/a/
>>>
>>> The exact system calls for getdents() and lstat() may depend on your
>>> kernel and userspace libraries.  Note that this will suppress all of
>>> the other systemcalls, but makes the output more readable.
>>>
>>> Another possibility is a bug in the ZFS ZAP processing code, which does
>>> not iterate over the entries properly, and doesn't return the names to
>>> userspace via getdents() at all.  You'd could use "zdb" to dump the
>>> directory to confirm the entry is there (it pretty much *HAS* to be, if
>>> the "dst/a/foo" lookup works).  At that point, running with tracepoints,
>>> or adding printk() debug messages and rebuilding the zfs.ko module would
>>> help debug where the problem is.
>>>
>>> Cheers, Andreas
>>>
>>>> The problem occurs when I run something like "cp -r src dst", where src
>>>> is a directory with 12 sub-directories with 6999 files each, about 84K
>>>> files total, 2.9TB. After copy finishes, dst is missing several thousand
>>>> files according to find. (Similar thing happened when I tarred src and
>>>> then unpacked it in a different location; according to tar --list the
>>>> tarball contained all files.)
>>>>
>>>> The cp command reported "No space left on device" for a couple of files.
>>>> The filesystem has about 80TB free (zpool is about 50% full). The files
>>>> for which "No space left on device" error was generated just weren't
>>>> created, it seems, but other missing files are accessible by their full
>>>> path but did not show up in directory listings (as shown above).
>>>>
>>>> ls is reporting some sub-directories of dst have 7000 hard links instead
>>>> of 7001 that the sub-directories in src have. All missing files seem to
>>>> be from such sub-directories.
>>>>
>>>> After rebooting the server, the missing were no longer accessible by
>>>> full path.
>>>>
>>>> It seems the problem is reproducible. Missing files are not always the same.
>>>>
>>>> I am running ZFS 0.7.7, Scientific Linux release 6.8. No ZFS snapshots.
>>>>
>>>> If anybody can shed light on this, I would really appreciate it :)
>>>>
>>>>
>>>> Vlad
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss at list.zfsonlinux.org
>>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 


More information about the zfs-discuss mailing list