Migration from the KQ Implementation
gordan.bobic at gmail.com
Tue May 3 16:44:18 EDT 2011
Most interesting, thanks for that. But why would it only affect things
when dedupe+compression is used? I run scrubs daily, and it doesn't seem
to happen when all the pools are without dedupe/compression.
On 05/03/2011 08:19 PM, Gunnar Beutner wrote:
> Hi Gordan,
> the problem you're describing sounds quite similar to
> https://github.com/behlendorf/zfs/issues/174 - the root cause for this
> is a stack overrun by code that's used for "zpool scrub".
> Once you've hit that problem it's virtually impossible to import the
> pool (unless you have a recent enough git checkout / Debian package
> where this issue is fixed) because ZFS will try to resume the "zpool
> scrub" as soon as it has imported the pool - and almost instantly crash
> A quick-and-dirty workaround I've used while debugging this problem is
> adding a 'return' statement right at the beginning of the
> dsl_scan_visitbp function. Of course this will effectively break/disable
> "zpool pool" - but I really didn't want re-create a 600GB pool at the time.
> As for kernel crashdumps: I've found kdump to be invaluable in debugging
> kernel panics like this. You might want to look into using that.
> Gunnar Beutner
> On 03.05.2011 20:54, Gordan Bobic wrote:
>> On 05/03/2011 05:51 PM, Brian Behlendorf wrote:
>>> Hi Gordon,
>>> I would suggest you try the latest 0.6.0-rc source from Github or
>>> Darik's PPA. This will become the 0.6.0-rc4 tag shortly if it passes
>>> the needed testing. I suspect you'll be pleasantly surprised with the
>>> stability of this implementation.
>>> git clone git://github.com/behlendorf/spl.git
>>> git clone git://github.com/behlendorf/zfs.git
>>> As for when an official 0.6.0 tag will be released it primarily depends
>>> on resolving the remaining 14 open bugs. We don't want to tag a stable
>>> release with any know stability/correctness issues.
>> Indeed, I understand that.
>>> As for your dedup+compression issue I would suggest trying to reproduce
>>> it with this code base.
>> The problem is that the issue is quite nebulous. Once it occurs, the
>> kernel will crash hard, and the stack trace goes off screen. The only
>> thing that I know for sure is that txg_sync is listed in the trace, and
>> that it will happen as soon as zfs module is loaded (if zpool.cache
>> exists), or as soon as the zpool is imported (if there is no zpool.cache).
>> In retrospect I should have probably made sure that the zpool is created
>> as v23 so that I could use the fuse implementation to scrub the pool out
>> and still have the data accessible, but I hadn't thought of that in time
>> - and I really want to avoid having to do another 4 TB restore (you
>> never quite realize just how much data 4TB is until you actually have to
>> copy it all).
>>> As you probably know the KQ implementation was
>>> originally derived from this project. However, we have fixed numerous
>>> issues which never made it back in to the KQ version. It wouldn't
>>> surprise me too much if your issue has already been addressed.
>> I tend to to put much faith in issues being fixed by accident. OTOH, I'd
>> love to be pointed at a closed bug report that sounds like it might have
>> been responsible for my crash. I'm not sure if the issue is caused by
>> merely having file systems with dedup+compress set to on, or whther is
>> caused by twiddling those two flags on a fs while a large file is being
>> copied to it. But it happened to me twice in 4 days in exactly the same
>> way, just when I got the fs restored back to how it was.
>> So - the issue should be re-creatable, but unfortunately, I'm not
>> prepared to go through the re-creation effort right now because it takes
>> me about 2 days to recover from it and the hardware I am running it on
>> isn't quite doable without for the next few days.
>>> If you
>>> are able to recreate it please open a bug on the issue tracker with the
>>> crash details. Just think of it as one more blocker for the 0.6.0
>> That's another problem - extracting crash details may prove challenging
>> since the machine locks up very hard, and the stack trace is> screen
>> buffer, so the best I could probably do easily is a photo of the part of
>> the kernel stack trace that fits on the screen.
>> Thanks for your input, I appreciate it. If for some reason another
>> restore becomes inevitable I'll try to re-create the problem with zpool
>> v23 so that I can fall back on the fuse implementation if needs be, and
>> try rc4 when it comes out.
More information about the zfs-discuss