[zfs-discuss] Unlistable files

Alan Latteri alan at instinctualsoftware.com
Sat Apr 7 14:55:55 EDT 2018


I believe the issue is related to primarycache=all, set to primarycache=metadata and try the tests again.

> On Apr 6, 2018, at 12:19 PM, Vladimir Brik via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
> 
>> Could you post an example of the "file not found" errors?  What generates
>> those messages?
> Those appear in the logs of gridftp (application we use for file
> transfers). They are just regular messages that gridftp server records
> when a client requests a file that doesn't exist. It is possible the
> clients request files that were never uploaded or were legitimately
> deleted, so I don't know to what degree this is related to zfs, if at all.
> 
> Thanks very much for everybody's comments. I'll create a github ticket
> for this.
> 
> 
> Vlad
> 
> 
> 
> On 04/06/2018 12:55 PM, Andreas Dilger wrote:
>> On Apr 6, 2018, at 10:47 AM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>> 
>>> A new symptom I noticed is that I am not longer able to access the
>>> un-listable files by path directly if I run "echo 3 >
>>> /proc/sys/vm/drop_caches".
>> 
>> This implies that the files are not actually written to the disk, but were
>> only in the dcache of the system and were removed when you dropped cache.
>> Something fairly seriously broken if that is the case.
>> 
>>> I ran the strace command and I don't see getdents returning 0. The names
>>> of the missing files do not appear in the output of strace at all, so it
>>> looks like the kernel does not return them.
>> 
>> If they were only in dcache, but not actually on disk, then they wouldn't
>> appear in the directory listing, since the directory is always generated
>> by the filesystem, while name lookups may be resolved from cache.
>> 
>>> I am not sure if the problem only happens with big directories. The
>>> machine this is happening on is a file server, and there seem to be more
>>> "file not found" errors than usual in the logs, but I can't tell if that
>>> is caused by the same issue, or if the clients were simply trying to
>>> open files that were never uploaded in the first place.
>> 
>> Could you post an example of the "file not found" errors?  What generates
>> those messages?  From your other email, it looks like the directories are
>> about 7000 files each?  I wouldn't say that is too large, but I think this
>> is less relevant in light of the above comments.
>> 
>>>> You'd could use "zdb" to dump the
>>>> directory to confirm the entry is there
>>> How do I do this?
>> 
>> Sorry, I don't know much about zdb, just that it _can_ do this kind of
>> debugging...  Maybe someone with zdb-fu could help?  That said, I'm not
>> sure this information is useful anymore.  It seems like the root problem
>> is that the files are not even written to disk, not that there is a problem
>> returning them from the directory.
>> 
>> My next suggestion (if you can reproduce this reliably with some test script)
>> is to use git bisect to build different zfs.ko modules and isolate the
>> problem to a specific patch, so it can be reverted or fixed.
>> 
>> Definitely it is also time to file an issue at Github, and make it clear
>> this is a data loss/corruption issue.
>> 
>> Cheers, Andreas
>> 
>>> 
>>> 
>>> On 04/05/2018 04:58 PM, Andreas Dilger wrote:
>>>> On Apr 5, 2018, at 2:34 PM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>>>> 
>>>>> Hello.
>>>>> 
>>>>> I have run into a strange issue where files don't show up in directory
>>>>> listing but can be accessed by path directly. I wonder if somebody knows
>>>>> what might have caused this.
>>>>> 
>>>>> # find dst/a/foo
>>>>> dst/a/foo
>>>>> (as expected)
>>>>> 
>>>>> # find dst/a/ -name foo
>>>>> (no output)
>>>>> 
>>>>> # ls -l dst/a/foo
>>>>> -rw-r--r-- 1 xxx xxx 5991051 Feb 22 13:35 dst/a/foo
>>>>> (as expected)
>>>>> 
>>>>> # ls -l dst/a/ | grep foo
>>>>> (no output)
>>>>> 
>>>>> # cp dst/a/foo bar
>>>>> (this works; bar is created and can be listed)
>>>> 
>>>> There are a few potential issues that might cause this.  One is if
>>>> getdents() returns from the kernel with d_ino == 0, then "ls" and
>>>> other directory walking tools will skip the entry as "deleted" for
>>>> historical reasons.
>>>> 
>>>> It might also be that "ls" and ZFS directory iteration do not play well
>>>> together, skipping some entries in the directory (e.g. hash collisions,
>>>> or if telldir() and seekdir() do not work properly).  If your problem
>>>> only happens on large directories then this is a possibility.
>>>> 
>>>> Run your "ls -l dst/a/" under strace and/or ltrace to see if these
>>>> entries are being returned from the kernel, but not printed by "ls",
>>>> or if they are not being returned by the kernel at all.  Something like:
>>>> 
>>>>   strace -f -e trace=open,getdents,lstat -v -y ls -l dst/a/
>>>> 
>>>> The exact system calls for getdents() and lstat() may depend on your
>>>> kernel and userspace libraries.  Note that this will suppress all of
>>>> the other systemcalls, but makes the output more readable.
>>>> 
>>>> Another possibility is a bug in the ZFS ZAP processing code, which does
>>>> not iterate over the entries properly, and doesn't return the names to
>>>> userspace via getdents() at all.  You'd could use "zdb" to dump the
>>>> directory to confirm the entry is there (it pretty much *HAS* to be, if
>>>> the "dst/a/foo" lookup works).  At that point, running with tracepoints,
>>>> or adding printk() debug messages and rebuilding the zfs.ko module would
>>>> help debug where the problem is.
>>>> 
>>>> Cheers, Andreas
>>>> 
>>>>> The problem occurs when I run something like "cp -r src dst", where src
>>>>> is a directory with 12 sub-directories with 6999 files each, about 84K
>>>>> files total, 2.9TB. After copy finishes, dst is missing several thousand
>>>>> files according to find. (Similar thing happened when I tarred src and
>>>>> then unpacked it in a different location; according to tar --list the
>>>>> tarball contained all files.)
>>>>> 
>>>>> The cp command reported "No space left on device" for a couple of files.
>>>>> The filesystem has about 80TB free (zpool is about 50% full). The files
>>>>> for which "No space left on device" error was generated just weren't
>>>>> created, it seems, but other missing files are accessible by their full
>>>>> path but did not show up in directory listings (as shown above).
>>>>> 
>>>>> ls is reporting some sub-directories of dst have 7000 hard links instead
>>>>> of 7001 that the sub-directories in src have. All missing files seem to
>>>>> be from such sub-directories.
>>>>> 
>>>>> After rebooting the server, the missing were no longer accessible by
>>>>> full path.
>>>>> 
>>>>> It seems the problem is reproducible. Missing files are not always the same.
>>>>> 
>>>>> I am running ZFS 0.7.7, Scientific Linux release 6.8. No ZFS snapshots.
>>>>> 
>>>>> If anybody can shed light on this, I would really appreciate it :)
>>>>> 
>>>>> 
>>>>> Vlad
>>>>> _______________________________________________
>>>>> zfs-discuss mailing list
>>>>> zfs-discuss at list.zfsonlinux.org
>>>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>>>> 
>>>> 
>>>> Cheers, Andreas
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss



More information about the zfs-discuss mailing list