[zfs-discuss] Unlistable files

Rich rincebrain at gmail.com
Wed Apr 11 09:25:19 EDT 2018


I think everyone on the thread is already aware, but just so anyone
reading the thread can link the two, this turned out to be the rather
nasty bug in https://github.com/zfsonlinux/zfs/issues/7401

- Rich

On Mon, Apr 9, 2018 at 10:38 AM, Vladimir Brik via zfs-discuss
<zfs-discuss at list.zfsonlinux.org> wrote:
>> I believe the issue is related to primarycache=all, set to
>> primarycache=metadata and try the tests again.
> Same problem.
>
> Vlad
>
>
> On 04/07/2018 01:55 PM, Alan Latteri wrote:
>> I believe the issue is related to primarycache=all, set to primarycache=metadata and try the tests again.
>>
>>> On Apr 6, 2018, at 12:19 PM, Vladimir Brik via zfs-discuss <zfs-discuss at list.zfsonlinux.org> wrote:
>>>
>>>> Could you post an example of the "file not found" errors?  What generates
>>>> those messages?
>>> Those appear in the logs of gridftp (application we use for file
>>> transfers). They are just regular messages that gridftp server records
>>> when a client requests a file that doesn't exist. It is possible the
>>> clients request files that were never uploaded or were legitimately
>>> deleted, so I don't know to what degree this is related to zfs, if at all.
>>>
>>> Thanks very much for everybody's comments. I'll create a github ticket
>>> for this.
>>>
>>>
>>> Vlad
>>>
>>>
>>>
>>> On 04/06/2018 12:55 PM, Andreas Dilger wrote:
>>>> On Apr 6, 2018, at 10:47 AM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>>>>
>>>>> A new symptom I noticed is that I am not longer able to access the
>>>>> un-listable files by path directly if I run "echo 3 >
>>>>> /proc/sys/vm/drop_caches".
>>>>
>>>> This implies that the files are not actually written to the disk, but were
>>>> only in the dcache of the system and were removed when you dropped cache.
>>>> Something fairly seriously broken if that is the case.
>>>>
>>>>> I ran the strace command and I don't see getdents returning 0. The names
>>>>> of the missing files do not appear in the output of strace at all, so it
>>>>> looks like the kernel does not return them.
>>>>
>>>> If they were only in dcache, but not actually on disk, then they wouldn't
>>>> appear in the directory listing, since the directory is always generated
>>>> by the filesystem, while name lookups may be resolved from cache.
>>>>
>>>>> I am not sure if the problem only happens with big directories. The
>>>>> machine this is happening on is a file server, and there seem to be more
>>>>> "file not found" errors than usual in the logs, but I can't tell if that
>>>>> is caused by the same issue, or if the clients were simply trying to
>>>>> open files that were never uploaded in the first place.
>>>>
>>>> Could you post an example of the "file not found" errors?  What generates
>>>> those messages?  From your other email, it looks like the directories are
>>>> about 7000 files each?  I wouldn't say that is too large, but I think this
>>>> is less relevant in light of the above comments.
>>>>
>>>>>> You'd could use "zdb" to dump the
>>>>>> directory to confirm the entry is there
>>>>> How do I do this?
>>>>
>>>> Sorry, I don't know much about zdb, just that it _can_ do this kind of
>>>> debugging...  Maybe someone with zdb-fu could help?  That said, I'm not
>>>> sure this information is useful anymore.  It seems like the root problem
>>>> is that the files are not even written to disk, not that there is a problem
>>>> returning them from the directory.
>>>>
>>>> My next suggestion (if you can reproduce this reliably with some test script)
>>>> is to use git bisect to build different zfs.ko modules and isolate the
>>>> problem to a specific patch, so it can be reverted or fixed.
>>>>
>>>> Definitely it is also time to file an issue at Github, and make it clear
>>>> this is a data loss/corruption issue.
>>>>
>>>> Cheers, Andreas
>>>>
>>>>>
>>>>>
>>>>> On 04/05/2018 04:58 PM, Andreas Dilger wrote:
>>>>>> On Apr 5, 2018, at 2:34 PM, Vladimir Brik <vladimir.brik at icecube.wisc.edu> wrote:
>>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I have run into a strange issue where files don't show up in directory
>>>>>>> listing but can be accessed by path directly. I wonder if somebody knows
>>>>>>> what might have caused this.
>>>>>>>
>>>>>>> # find dst/a/foo
>>>>>>> dst/a/foo
>>>>>>> (as expected)
>>>>>>>
>>>>>>> # find dst/a/ -name foo
>>>>>>> (no output)
>>>>>>>
>>>>>>> # ls -l dst/a/foo
>>>>>>> -rw-r--r-- 1 xxx xxx 5991051 Feb 22 13:35 dst/a/foo
>>>>>>> (as expected)
>>>>>>>
>>>>>>> # ls -l dst/a/ | grep foo
>>>>>>> (no output)
>>>>>>>
>>>>>>> # cp dst/a/foo bar
>>>>>>> (this works; bar is created and can be listed)
>>>>>>
>>>>>> There are a few potential issues that might cause this.  One is if
>>>>>> getdents() returns from the kernel with d_ino == 0, then "ls" and
>>>>>> other directory walking tools will skip the entry as "deleted" for
>>>>>> historical reasons.
>>>>>>
>>>>>> It might also be that "ls" and ZFS directory iteration do not play well
>>>>>> together, skipping some entries in the directory (e.g. hash collisions,
>>>>>> or if telldir() and seekdir() do not work properly).  If your problem
>>>>>> only happens on large directories then this is a possibility.
>>>>>>
>>>>>> Run your "ls -l dst/a/" under strace and/or ltrace to see if these
>>>>>> entries are being returned from the kernel, but not printed by "ls",
>>>>>> or if they are not being returned by the kernel at all.  Something like:
>>>>>>
>>>>>>   strace -f -e trace=open,getdents,lstat -v -y ls -l dst/a/
>>>>>>
>>>>>> The exact system calls for getdents() and lstat() may depend on your
>>>>>> kernel and userspace libraries.  Note that this will suppress all of
>>>>>> the other systemcalls, but makes the output more readable.
>>>>>>
>>>>>> Another possibility is a bug in the ZFS ZAP processing code, which does
>>>>>> not iterate over the entries properly, and doesn't return the names to
>>>>>> userspace via getdents() at all.  You'd could use "zdb" to dump the
>>>>>> directory to confirm the entry is there (it pretty much *HAS* to be, if
>>>>>> the "dst/a/foo" lookup works).  At that point, running with tracepoints,
>>>>>> or adding printk() debug messages and rebuilding the zfs.ko module would
>>>>>> help debug where the problem is.
>>>>>>
>>>>>> Cheers, Andreas
>>>>>>
>>>>>>> The problem occurs when I run something like "cp -r src dst", where src
>>>>>>> is a directory with 12 sub-directories with 6999 files each, about 84K
>>>>>>> files total, 2.9TB. After copy finishes, dst is missing several thousand
>>>>>>> files according to find. (Similar thing happened when I tarred src and
>>>>>>> then unpacked it in a different location; according to tar --list the
>>>>>>> tarball contained all files.)
>>>>>>>
>>>>>>> The cp command reported "No space left on device" for a couple of files.
>>>>>>> The filesystem has about 80TB free (zpool is about 50% full). The files
>>>>>>> for which "No space left on device" error was generated just weren't
>>>>>>> created, it seems, but other missing files are accessible by their full
>>>>>>> path but did not show up in directory listings (as shown above).
>>>>>>>
>>>>>>> ls is reporting some sub-directories of dst have 7000 hard links instead
>>>>>>> of 7001 that the sub-directories in src have. All missing files seem to
>>>>>>> be from such sub-directories.
>>>>>>>
>>>>>>> After rebooting the server, the missing were no longer accessible by
>>>>>>> full path.
>>>>>>>
>>>>>>> It seems the problem is reproducible. Missing files are not always the same.
>>>>>>>
>>>>>>> I am running ZFS 0.7.7, Scientific Linux release 6.8. No ZFS snapshots.
>>>>>>>
>>>>>>> If anybody can shed light on this, I would really appreciate it :)
>>>>>>>
>>>>>>>
>>>>>>> Vlad
>>>>>>> _______________________________________________
>>>>>>> zfs-discuss mailing list
>>>>>>> zfs-discuss at list.zfsonlinux.org
>>>>>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>>>>>>
>>>>>>
>>>>>> Cheers, Andreas
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> Cheers, Andreas
>>>>
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at list.zfsonlinux.org
>>> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at list.zfsonlinux.org
> http://list.zfsonlinux.org/cgi-bin/mailman/listinfo/zfs-discuss


More information about the zfs-discuss mailing list