[zfs-discuss] Something wrong with my zfs pool (still), can't figure out where the problem is or what it is.

Christ Schlacta aarcane at aarcane.org
Fri Oct 7 18:45:11 EDT 2011


I tried booting and importing in Oracle Solaris 11 Express 2010.11, but 
it hung for over 5 minutes, and both zpool status and zfs iostat froze 
trying to run once after the import ran (zpool import -N tank)
I'll try illumos tomorrow (today is very busy) and I don't think I have 
access to MacZFS right now at all.
So I spent a little time with it, after any single action, zfs starts to 
lock up.  the trace messages mention mutex in the last lines almost 
exclusively, so I think there may be some contention or race issues..  
then again, something may just be locking the mutex and doing it's thing 
for forever.

On 10/7/2011 05:18, Derek Day wrote:
>
> How is illumOS ay this point? Is booting up a live CD an option yet?
>
> The safest thing is probably to try importing on a different build and 
> see of that build can find the problem... The send/recv is still a 
> decent option but if there is any bug or error output that is being 
> lost it might just keep trying until your disk fails.
>
> You might also test for power issues. in the MacZFS, I have noticed 
> that most inexplicable behavior is power related and OS caught by ZFS 
> instead of errors being silently ignored until they become catastrophic.
>
> I need to get back to having a snapshot and send task that keeps a 
> physically separated copy of my pools. Once disks start failing it 
> gets a lot harder.
>
> Derek
>
> On Oct 6, 2011 9:57 PM, "Christ Schlacta" <aarcane at aarcane.org 
> <mailto:aarcane at aarcane.org>> wrote:
> >
> > root at density:~# zpool status -v tank
> >   pool: tank
> >  state: ONLINE
> >  scan: scrub canceled on Thu Oct  6 15:10:05 2011
> > config:
> >
> >         NAME                        STATE     READ WRITE CKSUM
> >         tank                        ONLINE       0     0     0
> >           raidz1-0                  ONLINE       0     0     0
> >             wwn-0x5000cca35dc3c27c  ONLINE       0     0     0
> >             wwn-0x5000cca35dc4d6e5  ONLINE       0     0     0
> >             wwn-0x5000cca35dc1c2bd  ONLINE       0     0     0
> >             wwn-0x5000cca35dc5f81d  ONLINE       0     0     0
> >             wwn-0x5000cca35dc6295c  ONLINE       0     0     0
> >             wwn-0x5000cca35dc4abb5  ONLINE       0     0     0
> >
> > errors: No known data errors
> >
> > Sadly, absolutely nothing is listed there, and I do have the disks 
> imported from by-id (cd /dev/disk/by-id/; rm ata*; rm scsi*; zpool 
> import -d . tank; reboot;)
> >
> >
> > On 10/6/2011 17:29, Derek Day wrote:
> >>
> >> What does zpool status -v tell you?
> >>
> >> I had big problems similar to this where i/o operations were 
> suspended. The issue was my /dev/sbx type devices kept reading places 
> and zfs couldn't figure poor the correct pool configuration. I 
> manipulated things enough to get back to the original device ordering 
> and then exported the pool and reimported using -d /dev/disks/by-id -a.
> >>
> >> Maybe you are seeing a similar issue?
> >>
> >> Derek
> >>
> >> On Oct 6, 2011 7:27 PM, "Christ Schlacta" <aarcane at aarcane.org 
> <mailto:aarcane at aarcane.org>> wrote:
> >> >
> >> > So I've bene having problems witht his pool for a little while, 
> and I've not been able to track it down, and I have no clue how to run 
> a debug trace or any of that fancy stuff...  but my zpool and zfs 
> commands keep...  hanging forever (or as long as I'm willing to wait, 
> which is long enough for the kernel watchdog to tell me it's been 120 
> seconds several times).  They do this on various commands, and durring 
> various situations.  my scrubs eventually hang and fall into a loop 
> where they read 15k every few seconds and nothing more even though the 
> disks are idle.  My mount command WAS failing on zfs mount 
> tank/backups/share/, but I fixed that by taking a snapshot, promoting 
> the snapshot, mounting that, snapshotting it again, and deleting the 
> original filesystem et snapshot and renaming back.  I pruned some data 
> from the filesystem that I thought might be the problem, (snapshot 
> before AND after, followed by destroy snapshots) and strangely, my 
> filesystem never recovered the space I thought it should have.  After 
> this surgery, another filesystem suddenly exhibited the strange 
> behavior of not mounting.
> >> >
> >> > When these filesystems don't mount, the best I can hope for is to 
> reboot, because they won't respond to ^C, ^Z, kill, kill -9, or kill 
> -11.  It's not even a hidden process or anything.  the terminal 
> itsself is stuck on the mount task.  I have another pool, as well. 
>  it's working perfectly, aside from sometimes hanging when the zpool 
> internals start hanging.
> >> >
> >> > Sync calls block as well.
> >> >
> >> > I most recently tried to run a zpool export and zpool import on 
> the pools.  good pool worked perfectly. but tank needed to be import 
> -f'd, which is pending currently.  It's been going for oh...  5-7 
> minutes now, and still ticking away.  the disk activity lights are 
> actually lighting up, so it's doing SOMETHING, I just don't know what, 
> or why.    These are some good average iostat (not zpool iostat) 
> numbers for the system:
> >> >
> >> > sdc             141.00        96.50         0.00         96       
>    0
> >> > sdd             131.00       108.00         0.00        108       
>    0
> >> > sdf             136.00        84.50         0.00         84       
>    0
> >> > sdg             136.00        70.50         0.00         70       
>    0
> >> > sdh             133.00        67.50         0.00         67       
>    0
> >> > sdi             149.00        78.50         0.00         78       
>    0
> >> >
> >> > Nothing interesting has shown up in dmesg aside from hung process 
> notifications.
> >> >
> >> > I'm at the point where I'm almost ready to start harvesting 
> drives to create a new zpool I can export the working filesystems to 
> (zfs send) and hope for the best, but I know that's a terrible idea 
> for numerous reasons.
> >> >
> >> > I'm looking for any advice, help, input, feedback, or anything 
> you have to help me fix this pool.
> >
> >
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.zfsonlinux.org/pipermail/zfs-discuss/attachments/20111007/79d4c99b/attachment.html>


More information about the zfs-discuss mailing list