Migration from the KQ Implementation

Brian Behlendorf behlendorf1 at llnl.gov
Tue May 3 16:58:47 EDT 2011


I'm in agreement with Gunnar, your compress+dedup crash sounds exactly
like the stack overrun described in issue 174 which has been fixed.

  https://github.com/behlendorf/zfs/issues/174

Additionally, even if this does turn out to be a different issue which
still exists.  I'm sure we can suggest a way to recover the pool without
resorting to recreating the pool from a backup.

-- 
Thanks,
Brian 

On Tue, 2011-05-03 at 11:54 -0700, Gordan Bobic wrote:
> On 05/03/2011 05:51 PM, Brian Behlendorf wrote:
> > Hi Gordon,
> >
> > I would suggest you try the latest 0.6.0-rc source from Github or
> > Darik's PPA.  This will become the 0.6.0-rc4 tag shortly if it passes
> > the needed testing.  I suspect you'll be pleasantly surprised with the
> > stability of this implementation.
> >
> > github:
> >    git clone git://github.com/behlendorf/spl.git
> >    git clone git://github.com/behlendorf/zfs.git
> >
> > PPA:
> >    https://launchpad.net/~dajhorn/+archive/zfs
> >
> > As for when an official 0.6.0 tag will be released it primarily depends
> > on resolving the remaining 14 open bugs.  We don't want to tag a stable
> > release with any know stability/correctness issues.
> >
> >    http://github.com/behlendorf/zfs/issues
> 
> Indeed, I understand that.
> 
> > As for your dedup+compression issue I would suggest trying to reproduce
> > it with this code base.
> 
> The problem is that the issue is quite nebulous. Once it occurs, the 
> kernel will crash hard, and the stack trace goes off screen. The only 
> thing that I know for sure is that txg_sync is listed in the trace, and 
> that it will happen as soon as zfs module is loaded (if zpool.cache 
> exists), or as soon as the zpool is imported (if there is no zpool.cache).
> 
> In retrospect I should have probably made sure that the zpool is created 
> as v23 so that I could use the fuse implementation to scrub the pool out 
> and still have the data accessible, but I hadn't thought of that in time 
> - and I really want to avoid having to do another 4 TB restore (you 
> never quite realize just how much data 4TB is until you actually have to 
> copy it all).
> 
> > As you probably know the KQ implementation was
> > originally derived from this project.  However, we have fixed numerous
> > issues which never made it back in to the KQ version.  It wouldn't
> > surprise me too much if your issue has already been addressed.
> 
> I tend to to put much faith in issues being fixed by accident. OTOH, I'd 
> love to be pointed at a closed bug report that sounds like it might have 
> been responsible for my crash. I'm not sure if the issue is caused by 
> merely having file systems with dedup+compress set to on, or whther is 
> caused by twiddling those two flags on a fs while a large file is being 
> copied to it. But it happened to me twice in 4 days in exactly the same 
> way, just when I got the fs restored back to how it was.
> 
> So - the issue should be re-creatable, but unfortunately, I'm not 
> prepared to go through the re-creation effort right now because it takes 
> me about 2 days to recover from it and the hardware I am running it on 
> isn't quite doable without for the next few days.
> 
> > If you
> > are able to recreate it please open a bug on the issue tracker with the
> > crash details.  Just think of it as one more blocker for the 0.6.0
> > release!
> 
> That's another problem - extracting crash details may prove challenging 
> since the machine locks up very hard, and the stack trace is > screen 
> buffer, so the best I could probably do easily is a photo of the part of 
> the kernel stack trace that fits on the screen.
> 
> Thanks for your input, I appreciate it. If for some reason another 
> restore becomes inevitable I'll try to re-create the problem with zpool 
> v23 so that I can fall back on the fuse implementation if needs be, and 
> try rc4 when it comes out.
> 
> Gordan



More information about the zfs-discuss mailing list