Re: -mm: xfs lockdep warning

Torsten Kaiser <just.for.lkml@xxxxxxxxxxxxxx> · Sat, 25 Sep 2010 15:08:11 +0200

[adding Tejun Heo, because I'm partly blaming his workqueues]

On Mon, Sep 20, 2010 at 9:13 PM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> On Fri, Sep 17, 2010 at 10:52:27AM +1000, Dave Chinner wrote:
>> Christoph, this implies an inode that has been marked for reclaim
>> that has not passed through xfs_fs_evict_inode() after being
>> initialised. If it went through the eviction process, the iolock
>> would have been re-initialised to a different context. Can you think
>> of any path that can get here without going through ->evict? I can't
>> off the top of my head...
>
> I think this could happen if the init_inode_always during
> re-initialization of an inode in reclaim fails in iget.  I have a patch
> to add that I'll run through xfsqa.  It should only happen very rarely.

I had the same lockdep report in mainline 2.6.36-rc5, when I enabled
lockdep to try to debug a hang under high load and some memory
pressure.
Do you have a patch I could try to get rid of this lockdep report to
see, if there is another one lurking behind it?

As for the hang: Today I wanted to install updates on my gentoo system
and that included an update of koffice. Because I allow emerge
(gentoos package manager) to process 4 packages in parallel and
specify -j5 for each of the compiles the result was, that emerge
decided to build 4 koffice programs in parallel which resulted in a
load ~18 on my 4 core system. With all these gcc running and the
compiles happening on a tmpfs this also resulted in some memory
pressure. (But not in a swap storm)

The first time I got this hang, I was not able to capture useful
informations, the second time it happened I got a report from the hung
task timeout:
3 processes where blocked in:
[ 1203.056532]  [<ffffffff810801c0>] ? sync_page+0x0/0x50
[ 1203.061836]  [<ffffffff8156078d>] ? io_schedule+0x3d/0x60
[ 1203.067390]  [<ffffffff810801fd>] ? sync_page+0x3d/0x50
probably because of these:
[ 1202.872424] INFO: task plasma-desktop:3169 blocked for more than 120 seconds.
[ 1202.879883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1202.888118]  ffff88011ef860b0 0000000000000086 000000011ef860b0
ffff880114faff48
[ 1202.895936]  ffff8800071cf1c0 0000000000012780 ffff880114faffd8
0000000000012780
[ 1202.903777]  ffff880114faffd8 ffff88011ef86310 ffff880114faffd8
ffff88011ef86308
[ 1202.911586] Call Trace:
[ 1202.914140]  [<ffffffff815612c1>] ? __mutex_lock_slowpath+0xe1/0x160
[ 1202.920693]  [<ffffffff810d1e57>] ? __d_lookup+0x97/0x120
[ 1202.926272]  [<ffffffff81560d8a>] ? mutex_lock+0x1a/0x40
[ 1202.931783]  [<ffffffff810c9dbf>] ? do_lookup+0x10f/0x190
[ 1202.937426]  [<ffffffff810c7e24>] ? acl_permission_check+0x54/0xb0
[ 1202.943898]  [<ffffffff810ca8d9>] ? link_path_walk+0x4c9/0x9a0
[ 1202.950028]  [<ffffffff810d65f4>] ? mnt_want_write+0x34/0x70
[ 1202.955947]  [<ffffffff810caecf>] ? path_walk+0x5f/0xe0
[ 1202.961416]  [<ffffffff810caf9b>] ? do_path_lookup+0x4b/0x50
[ 1202.967325]  [<ffffffff810cbc35>] ? user_path_at+0x55/0xb0
[ 1202.973020]  [<ffffffff8102e860>] ? __dequeue_entity+0x40/0x50
[ 1202.979054]  [<ffffffff8107bcf4>] ? perf_event_task_sched_out+0x44/0x260
[ 1202.985995]  [<ffffffff810c2e46>] ? vfs_fstatat+0x36/0x80
[ 1202.991600]  [<ffffffff815602ac>] ? schedule+0x24c/0x6f0
[ 1202.997157]  [<ffffffff810c2f8f>] ? sys_newstat+0x1f/0x50
[ 1203.002784]  [<ffffffff81003035>] ? device_not_available+0x15/0x20
[ 1203.009204]  [<ffffffff8100246b>] ? system_call_fastpath+0x16/0x1b

[ 1202.400046] INFO: task kworker/u:8:845 blocked for more than 120 seconds.
[ 1202.407128] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1202.415372]  ffff88007ffa8b60 0000000000000046 ffff88007f74fe98
0000000000000000
[ 1202.423229]  ffff8800071cf770 0000000000012780 ffff8800070e9fd8
0000000000012780
[ 1202.431054]  ffff8800070e9fd8 ffff88007ffa8dc0 ffff8800070e9fd8
ffff88007ffa8db8
[ 1202.438898] Call Trace:
[ 1202.441495]  [<ffffffff814361ed>] ? md_write_start+0x9d/0x190
[ 1202.447536]  [<ffffffff81052e70>] ? autoremove_wake_function+0x0/0x30
[ 1202.454311]  [<ffffffff810925ca>] ? zone_nr_free_pages+0x9a/0xb0
[ 1202.460639]  [<ffffffff8142c423>] ? make_request+0x23/0x800
[ 1202.466496]  [<ffffffff811d90ea>] ? blkcipher_walk_done+0x8a/0x230
[ 1202.472990]  [<ffffffff811df4c4>] ? crypto_cbc_encrypt+0xe4/0x180
[ 1202.479327]  [<ffffffff8102a150>] ? twofish_encrypt+0x0/0x10
[ 1202.485276]  [<ffffffff81431ac3>] ? md_make_request+0xc3/0x220
[ 1202.491412]  [<ffffffff811eafca>] ? generic_make_request+0x18a/0x330
[ 1202.498096]  [<ffffffff814484fc>] ? crypt_convert+0x25c/0x310
[ 1202.504090]  [<ffffffff81448881>] ? kcryptd_crypt+0x2d1/0x3f0
[ 1202.510031]  [<ffffffff814485b0>] ? kcryptd_crypt+0x0/0x3f0
[ 1202.516081]  [<ffffffff8104d2eb>] ? process_one_work+0xfb/0x370
[ 1202.522123]  [<ffffffff8104ee4c>] ? worker_thread+0x16c/0x360
[ 1202.527935]  [<ffffffff8104ece0>] ? worker_thread+0x0/0x360
[ 1202.533541]  [<ffffffff8104ece0>] ? worker_thread+0x0/0x360
[ 1202.539131]  [<ffffffff810529e6>] ? kthread+0x96/0xa0
[ 1202.544245]  [<ffffffff81003194>] ? kernel_thread_helper+0x4/0x10
[ 1202.550363]  [<ffffffff81052950>] ? kthread+0x0/0xa0
[ 1202.555346]  [<ffffffff81003190>] ? kernel_thread_helper+0x0/0x10

After that I enabled lockdep and retried the same update: This time no
hang, only the same lockdep report that Yang Ruirui had.

So I'm currently at a loss how I should continue from here. The XFS
false positive drowns any other lockdep problems, but XFS had some
hang problems with the new workqueues. (Personally I had no other hang
with earlier 2.6.36-rcs)
As seen in the call trace from the kworker, there are more workqueues
involved, thats why I added Tejun to the CC. My root filesystem is XFS
on dm-crypt on a md/raid1, so it might have been something there.

If you need more information, just ask, I will try to provide it.

Torsten

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs