Re: 3.2.9 and locking problem

Arkadiusz Miśkiewicz <arekm@xxxxxxxx> · Mon, 12 Mar 2012 14:43:58 +0100

On Monday 12 of March 2012, Dave Chinner wrote:
> On Fri, Mar 09, 2012 at 08:28:47PM +0100, Arkadiusz Miśkiewicz wrote:
> > Are there any bugs in area visible in tracebacks below? I have a system
> > where one operation (upgrade of single rpm package) causes rpm process
> > to hang in D-state, sysrq-w below:
> > 
> > [  400.755253] SysRq : Show Blocked State
> > [  400.758507]   task                        PC stack   pid father
> > [  400.758507] rpm             D 0000000100005781     0  8732   8698
> > 0x00000000 [  400.758507]  ffff88021657dc48 0000000000000086
> > ffff880200000000 ffff88025126f480 [  400.758507]  ffff880252276630
> > ffff88021657dfd8 ffff88021657dfd8 ffff88021657dfd8 [  400.758507] 
> > ffff880252074af0 ffff880252276630 ffff88024cb0d005 ffff88021657dcb0 [ 
> > 400.758507] Call Trace:
> > [  400.758507]  [<ffffffff8114b22a>] ? kmem_cache_free+0x2a/0x110
> > [  400.758507]  [<ffffffff8114d2ed>] ? kmem_cache_alloc+0x11d/0x140
> > [  400.758507]  [<ffffffffa00df3c7>] ? kmem_zone_alloc+0x67/0xe0 [xfs]
> > [  400.758507]  [<ffffffff8148b78a>] schedule+0x3a/0x50
> > [  400.758507]  [<ffffffff8148d25d>] rwsem_down_failed_common+0xbd/0x150
> > [  400.758507]  [<ffffffff8148d303>] rwsem_down_write_failed+0x13/0x20
> > [  400.758507]  [<ffffffff812652a3>]
> > call_rwsem_down_write_failed+0x13/0x20 [  400.758507] 
> > [<ffffffff8148c8ed>] ? down_write+0x2d/0x40
> > [  400.758507]  [<ffffffffa00cf97c>] xfs_ilock+0xcc/0x120 [xfs]
> > [  400.758507]  [<ffffffffa00d4ace>] xfs_setattr_nonsize+0x1ce/0x5b0
> > [xfs] [  400.758507]  [<ffffffff81265502>] ?
> > __strncpy_from_user+0x22/0x60 [  400.758507]  [<ffffffffa00d52ab>]
> > xfs_vn_setattr+0x1b/0x40 [xfs] [  400.758507]  [<ffffffff8117c1a2>]
> > notify_change+0x1a2/0x340
> > [  400.758507]  [<ffffffff8115ed80>] chown_common+0xd0/0xf0
> > [  400.758507]  [<ffffffff8115fe4c>] sys_chown+0xac/0x1a0
> > [  400.758507]  [<ffffffff81495112>] system_call_fastpath+0x16/0x1b
> 
> I can't see why we'd get a task stuck here - it's waiting on the
> XFS_ILOCK_EXCL. The only reason for this is if we leaked an unlock
> somewhere. It appears you can reproduce this fairly quickly, 

linux vserver patch [1] seems to be messing with locking. Would be nice if you 
could make a quick look at it to see if it can be considered guilty part?

On the other hand I wasn't able to reproduce on 3.0.22. vserver patch for .22 
[2] is doing the same thing as vserver patch for 3.2.9.

> so
> running an event trace via trace-cmd for all the xfs_ilock trace
> points and posting the report output might tell us what inode is
> blocked and where we leaked (if that is the cause).

Will try to get more information but this will take some time (most likely 
weeks) to get this machine down for debugging.

> Cheers,
> Dave.

1. http://vserver.13thfloor.at/Experimental/patch-3.2.9-vs2.3.2.7.diff
2. http://vserver.13thfloor.at/Experimental/patch-3.0.22-vs2.3.2.3.diff
-- 
Arkadiusz Miśkiewicz        PLD/Linux Team
arekm / maven.pl            http://ftp.pld-linux.org/

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs