[zfs-discuss] Unlistable files

Vladimir Brik vladimir.brik at icecube.wisc.edu
Mon Apr 9 10:38:35 EDT 2018


> I believe the issue is related to primarycache=all, set to
> primarycache=metadata and try the tests again.
Same problem.

Vlad


On 04/07/2018 01:55 PM, Alan Latteri wrote:
> I believe the issue is related to primarycache=all, set to primarycache=metadata and try the tests again.
> 
>> On Apr 6, 2018, at 12:19 PM, Vladimir Brik via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
>>
>>> Could you post an example of the "file not found" errors?  What generates
>>> those messages?
>> Those appear in the logs of gridftp (application we use for file
>> transfers). They are just regular messages that gridftp server records
>> when a client requests a file that doesn't exist. It is possible the
>> clients request files that were never uploaded or were legitimately
>> deleted, so I don't know to what degree this is related to zfs, if at all.
>>
>> Thanks very much for everybody's comments. I'll create a github ticket
>> for this.
>>
>>
>> Vlad
>>
>>
>>
>> On 04/06/2018 12:55 PM, Andreas Dilger wrote:
>>> On Apr 6, 2018, at 10:47 AM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>>>
>>>> A new symptom I noticed is that I am not longer able to access the
>>>> un-listable files by path directly if I run "echo 3 >
>>>> /proc/sys/vm/drop_caches".
>>>
>>> This implies that the files are not actually written to the disk, but were
>>> only in the dcache of the system and were removed when you dropped cache.
>>> Something fairly seriously broken if that is the case.
>>>
>>>> I ran the strace command and I don't see getdents returning 0. The names
>>>> of the missing files do not appear in the output of strace at all, so it
>>>> looks like the kernel does not return them.
>>>
>>> If they were only in dcache, but not actually on disk, then they wouldn't
>>> appear in the directory listing, since the directory is always generated
>>> by the filesystem, while name lookups may be resolved from cache.
>>>
>>>> I am not sure if the problem only happens with big directories. The
>>>> machine this is happening on is a file server, and there seem to be more
>>>> "file not found" errors than usual in the logs, but I can't tell if that
>>>> is caused by the same issue, or if the clients were simply trying to
>>>> open files that were never uploaded in the first place.
>>>
>>> Could you post an example of the "file not found" errors?  What generates
>>> those messages?  From your other email, it looks like the directories are
>>> about 7000 files each?  I wouldn't say that is too large, but I think this
>>> is less relevant in light of the above comments.
>>>
>>>>> You'd could use "zdb" to dump the
>>>>> directory to confirm the entry is there
>>>> How do I do this?
>>>
>>> Sorry, I don't know much about zdb, just that it _can_ do this kind of
>>> debugging...  Maybe someone with zdb-fu could help?  That said, I'm not
>>> sure this information is useful anymore.  It seems like the root problem
>>> is that the files are not even written to disk, not that there is a problem
>>> returning them from the directory.
>>>
>>> My next suggestion (if you can reproduce this reliably with some test script)
>>> is to use git bisect to build different zfs.ko modules and isolate the
>>> problem to a specific patch, so it can be reverted or fixed.
>>>
>>> Definitely it is also time to file an issue at Github, and make it clear
>>> this is a data loss/corruption issue.
>>>
>>> Cheers, Andreas
>>>
>>>>
>>>>
>>>> On 04/05/2018 04:58 PM, Andreas Dilger wrote:
>>>>> On Apr 5, 2018, at 2:34 PM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I have run into a strange issue where files don't show up in directory
>>>>>> listing but can be accessed by path directly. I wonder if somebody knows
>>>>>> what might have caused this.
>>>>>>
>>>>>> # find dst/a/foo
>>>>>> dst/a/foo
>>>>>> (as expected)
>>>>>>
>>>>>> # find dst/a/ -name foo
>>>>>> (no output)
>>>>>>
>>>>>> # ls -l dst/a/foo
>>>>>> -rw-r--r-- 1 xxx xxx 5991051 Feb 22 13:35 dst/a/foo
>>>>>> (as expected)
>>>>>>
>>>>>> # ls -l dst/a/ | grep foo
>>>>>> (no output)
>>>>>>
>>>>>> # cp dst/a/foo bar
>>>>>> (this works; bar is created and can be listed)
>>>>>
>>>>> There are a few potential issues that might cause this.  One is if
>>>>> getdents() returns from the kernel with d_ino == 0, then "ls" and
>>>>> other directory walking tools will skip the entry as "deleted" for
>>>>> historical reasons.
>>>>>
>>>>> It might also be that "ls" and ZFS directory iteration do not play well
>>>>> together, skipping some entries in the directory (e.g. hash collisions,
>>>>> or if telldir() and seekdir() do not work properly).  If your problem
>>>>> only happens on large directories then this is a possibility.
>>>>>
>>>>> Run your "ls -l dst/a/" under strace and/or ltrace to see if these
>>>>> entries are being returned from the kernel, but not printed by "ls",
>>>>> or if they are not being returned by the kernel at all.  Something like:
>>>>>
>>>>>   strace -f -e trace=open,getdents,lstat -v -y ls -l dst/a/
>>>>>
>>>>> The exact system calls for getdents() and lstat() may depend on your
>>>>> kernel and userspace libraries.  Note that this will suppress all of
>>>>> the other systemcalls, but makes the output more readable.
>>>>>
>>>>> Another possibility is a bug in the ZFS ZAP processing code, which does
>>>>> not iterate over the entries properly, and doesn't return the names to
>>>>> userspace via getdents() at all.  You'd could use "zdb" to dump the
>>>>> directory to confirm the entry is there (it pretty much *HAS* to be, if
>>>>> the "dst/a/foo" lookup works).  At that point, running with tracepoints,
>>>>> or adding printk() debug messages and rebuilding the zfs.ko module would
>>>>> help debug where the problem is.
>>>>>
>>>>> Cheers, Andreas
>>>>>
>>>>>> The problem occurs when I run something like "cp -r src dst", where src
>>>>>> is a directory with 12 sub-directories with 6999 files each, about 84K
>>>>>> files total, 2.9TB. After copy finishes, dst is missing several thousand
>>>>>> files according to find. (Similar thing happened when I tarred src and
>>>>>> then unpacked it in a different location; according to tar --list the
>>>>>> tarball contained all files.)
>>>>>>
>>>>>> The cp command reported "No space left on device" for a couple of files.
>>>>>> The filesystem has about 80TB free (zpool is about 50% full). The files
>>>>>> for which "No space left on device" error was generated just weren't
>>>>>> created, it seems, but other missing files are accessible by their full
>>>>>> path but did not show up in directory listings (as shown above).
>>>>>>
>>>>>> ls is reporting some sub-directories of dst have 7000 hard links instead
>>>>>> of 7001 that the sub-directories in src have. All missing files seem to
>>>>>> be from such sub-directories.
>>>>>>
>>>>>> After rebooting the server, the missing were no longer accessible by
>>>>>> full path.
>>>>>>
>>>>>> It seems the problem is reproducible. Missing files are not always the same.
>>>>>>
>>>>>> I am running ZFS 0.7.7, Scientific Linux release 6.8. No ZFS snapshots.
>>>>>>
>>>>>> If anybody can shed light on this, I would really appreciate it :)
>>>>>>
>>>>>>
>>>>>> Vlad
>>>>>> _______________________________________________
>>>>>> zfs-discuss mailing list
>>>>>> zfs-discuss at list.zfsonlinux.org
>>>>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>>>>>
>>>>>
>>>>> Cheers, Andreas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at list.zfsonlinux.org
>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
> 


More information about the zfs-discuss mailing list