[zfs-discuss] null pointer dereference ddt_phys_decref+0x9/0x10

Matthew Robbetts wingfeathera at gmail.com
Thu Mar 22 10:31:28 EDT 2012


Excuse me, that should be zfs_arc_max=0x20000000.


On 22 Mar 2012, at 14:25, Matthew Robbetts wrote:

> Compression, you say? That's been a bit of a sticking point recently. In my own experience, using compression can lead to lockups after several hours of rsync. Just before the lockup, top showed that arc_reclaim and kswapd0 were both contending for 100% CPU time. Pretty much no actual I/O was happening at that point, the thing was just spinning maniacally and slightly later locked up. I believe this behaviour is described within Issue 149.
> 
> What is the value of your /proc/sys/vm/min_free_kbytes? I've had to increase it drastically from its default (of the order of 5k on my system under Gentoo) to get stability with ZoL. I experience the above lockup with it set to 128MB. With 256MB I've so far not experienced it (but haven't properly stressed the system yet).
> 
> Also, on my 2GB system I set zfs_arc_max to 512MB (0x40000000). Along with the 256MB min_free_kbytes this seems to works quite nicely. Perhaps give those numbers a try.
> 
> 
> 
> On 22 Mar 2012, at 13:19, ScOut3R wrote:
> 
>> arc_reclaim locks up, but the first stack trace comes from z_int_rw
>> (or something) with a kernel OOPS. This happens when i'm rsyncing data
>> from an mdadm/ext4 array to the ZFS pool. I'm using the daily repo
>> now. What i noticed it a few minutes before the oops happens the rsync
>> speed drops significantly. The system is running with 3GB ram and i've
>> set arc's max memory to 2GB. I don't use dedup, but i have compression
>> enabled on a half of the volumes. Do you need any other info which
>> would be helpful?
>> 
>> On Mar 22, 12:15 pm, ScOut3R <mailingl... at modernbiztonsag.org> wrote:
>>> Thank you, i just bumped into the opened issue.
>>> 
>>> After i was able to create a new pool from the same drives i began to
>>> copy over the backed up data and the copy process crashed. The kernel
>>> was spilling out process hangup warning. I did not manage to catch the
>>> actual process name, but i'm running memtest now on the system. I
>>> wouldn't be surprised if the initial pool corruption was caused by
>>> faulty hardware.
>>> 
>>> On Mar 19, 5:49 pm, Brian Behlendorf <behlendo... at llnl.gov> wrote:
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> This look like an upstream bug which is very well described in the
>>>> following post to zfs-discuss.  If you were to make the same change I
>>>> suspect you would be be able to import the pool.  I'd then suggest you
>>>> migrate off your data to a new pool.
>>> 
>>>> http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/05097...
>>> 
>>>> How your pool could have been damaged like this is the real mystery, but
>>>> it appears your not the first person to see this.  I'll open an issue so
>>>> we can at least update the code with some better error handling.
>>> 
>>>> --
>>>> Thanks,
>>>> Brian
>>> 
>>>> On Mon, 2012-03-19 at 03:12 -0700, ScOut3R wrote:
>>>>> Dear List,
>>> 
>>>>> i'm running Ubuntu Lucid 64bit with latest stable ppa packages
>>>>> (0.6.0.54). On import i get the following kernel output and the import
>>>>> fails:
>>> 
>>>>> [ 1993.534996] BUG: unable to handle kernel NULL pointer dereference
>>>>> at 0000000000000030
>>>>> [ 1993.535026] IP: [<ffffffffa067d129>] ddt_phys_decref+0x9/0x10 [zfs]
>>>>> [ 1993.535088] PGD b5e18067 PUD b5eeb067 PMD 0
>>>>> [ 1993.535104] Oops: 0002 [#1] SMP
>>>>> [ 1993.535116] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/
>>>>> host4/target4:0:0/4:0:0:0/block/sdd/queue/scheduler
>>>>> [ 1993.535133] CPU 0
>>>>> [ 1993.535140] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P)
>>>>> zunicode(P) spl fbcon tileblit font bitblit softcursor
>>>>> snd_hda_codec_realtek snd_hda_intel vga16fb vgastate snd_hda_codec
>>>>> snd_hwdep nouveau ppdev snd_pcm ttm drm_kms_helper lp parport_pc
>>>>> snd_timer parport drm snd soundcore snd_page_alloc i2c_algo_bit
>>>>> intel_agp zlib_deflate raid10 raid456 async_pq async_xor xor
>>>>> async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 ohci1394
>>>>> multipath 3w_9xxx ieee1394 pata_jmicron r8169 mii linear ahci
>>>>> [ 1993.535309] Pid: 29559, comm: z_fr_iss/0 Tainted: P
>>>>> 2.6.32-38-generic #83-Ubuntu P35-DS3P
>>>>> [ 1993.535323] RIP: 0010:[<ffffffffa067d129>]  [<ffffffffa067d129>]
>>>>> ddt_phys_decref+0x9/0x10 [zfs]
>>>>> [ 1993.535375] RSP: 0018:ffff88009af39dc0  EFLAGS: 00010246
>>>>> [ 1993.535384] RAX: 0000000000000000 RBX: ffff880099dd0000 RCX:
>>>>> ffffffff817b8ae0
>>>>> [ 1993.535396] RDX: 0000000000000004 RSI: ffff880099d28c70 RDI:
>>>>> 0000000000000000
>>>>> [ 1993.535407] RBP: ffff88009af39dc0 R08: 0000000000000000 R09:
>>>>> 964bfdd7b6bcfd8d
>>>>> [ 1993.535418] R10: 0000000000000001 R11: 0000000000000001 R12:
>>>>> ffff880099d28c70
>>>>> [ 1993.535429] R13: ffff8800b4cd8800 R14: 0000000000000000 R15:
>>>>> ffff88009a1aec50
>>>>> [ 1993.535442] FS:  0000000000000000(0000) GS:ffff880001c00000(0000)
>>>>> knlGS:0000000000000000
>>>>> [ 1993.535454] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>>>> [ 1993.535464] CR2: 0000000000000030 CR3: 00000000b5e72000 CR4:
>>>>> 00000000000006f0
>>>>> [ 1993.535476] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>> 0000000000000000
>>>>> [ 1993.535487] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>>> 0000000000000400
>>>>> [ 1993.535498] Process z_fr_iss/0 (pid: 29559, threadinfo
>>>>> ffff88009af38000, task ffff88009daaade0)
>>>>> [ 1993.535511] Stack:
>>>>> [ 1993.535516]  ffff88009af39de0 ffffffffa0708cf1 ffff880099d28c10
>>>>> 0000000000000200
>>>>> [ 1993.535532] <0> ffff88009af39e10 ffffffffa070c339 ffff88009a1aec40
>>>>> ffff8800baf79600
>>>>> [ 1993.535552] <0> ffff880099d28fa0 ffff8800baf79628 ffff88009af39ee0
>>>>> ffffffffa05b5c84
>>>>> [ 1993.535574] Call Trace:
>>>>> [ 1993.535633]  [<ffffffffa0708cf1>] zio_ddt_free+0x51/0x70 [zfs]
>>>>> [ 1993.535691]  [<ffffffffa070c339>] zio_execute+0x99/0xf0 [zfs]
>>>>> [ 1993.535715]  [<ffffffffa05b5c84>] taskq_thread+0x224/0x5b0 [spl]
>>>>> [ 1993.535731]  [<ffffffff815452c9>] ? thread_return+0x48/0x41f
>>>>> [ 1993.535744]  [<ffffffff8105cd70>] ? default_wake_function+0x0/0x20
>>>>> [ 1993.535766]  [<ffffffffa05b5a60>] ? taskq_thread+0x0/0x5b0 [spl]
>>>>> [ 1993.535779]  [<ffffffff81084a06>] kthread+0x96/0xa0
>>>>> [ 1993.535791]  [<ffffffff810131ea>] child_rip+0xa/0x20
>>>>> [ 1993.535802]  [<ffffffff81084970>] ? kthread+0x0/0xa0
>>>>> [ 1993.535814]  [<ffffffff810131e0>] ? child_rip+0x0/0x20
>>>>> [ 1993.535822] Code: 75 04 48 8b 46 50 48 89 47 38 c9 c3 66 0f 1f 44
>>>>> 00 00 55 48 89 e5 0f 1f 44 00 00 48 83 47 30 01 c9 c3 55 48 89 e5 0f
>>>>> 1f 44 00 00 <48> 83 6f 30 01 c9 c3 55 48 89 e5 0f 1f 44 00 00 31 d2 48
>>>>> 8d 47
>>>>> [ 1993.535969] RIP  [<ffffffffa067d129>] ddt_phys_decref+0x9/0x10
>>>>> [zfs]
>>>>> [ 1993.536018]  RSP <ffff88009af39dc0>
>>>>> [ 1993.536636] CR2: 0000000000000030
>>>>> [ 1993.548418] [drm] nouveau 0000:01:00.0: Setting dpms mode 0 on vga
>>>>> encoder (output 0)
>>>>> [ 1993.554358] ---[ end trace 9dd386860ac68813 ]---
>>> 
>>>>> The beginning of this story is that i had a 500GB volume in the pool
>>>>> with deduplication. To revert this i wanted to destroy that
>>>>> filesystem, but the system always crashed during the process, so i've
>>>>> booted up the latest OpenIndiana system and tried to delete the
>>>>> filesystem from there. The process was running for days but after a
>>>>> week or so the system rebooted (i guess it crashed too). After the
>>>>> reboot the OpenIndiana system could not import the pool, because it
>>>>> crashed during the import. When i tried to import the same pool under
>>>>> Ubuntu i got the error shown above.
>>>>> Is there a way to bring some life back into my pool?
>>> 
>>>>> Best regards,
>>>>> Mate
> 



More information about the zfs-discuss mailing list