On Tue, May 07, 2013 at 03:24:28PM -0500, Mark Tinguely wrote: > On 05/07/13 15:22, Dave Jones wrote: > >On Tue, May 07, 2013 at 03:04:33PM -0500, Mark Tinguely wrote: > > > On 05/07/13 14:59, Dave Jones wrote: > > > > On Tue, May 07, 2013 at 02:58:15PM -0500, Mark Tinguely wrote: > > > > > > > > > > I can hit this almost instantly with fsx. I'll do a bisect, though > > > > > > it sounds like you already have a suspect. > > > > > > > > > > > > > > > > If you want to try kmem debug of Linux 3.8 that would help. > > > > > > > > I'm not sure what that is. > > > > > > Sorry, if you would test Linux 3.8 with "CONFIG_DEBUG_SLAB=y". > > > >Ah, done that. (I pretty much always run with it). > > > >This is something new. Even 3.9 was fine. It's only since > >the recent xfs merge. > > > > Dave > > > > git revert 666d644cd72a9ec58b353209ff191d7430f3b357 That won't prevent the use after free. That commit fixed a problem that could lead to a use after free, but what we are seeing here is that it has ultimately exposed a previously unknown issue that causes the use after free. Basically what is happening is that there are two commits for the EFD being processed, when there should only be one. I'm not sure how this is happening yet, but these three traces came out from my debug sequentially when running generic/006: __xfs_efi_release 0xffff88006c4d7058 2 Pid: 4756, comm: kworker/1:1H Tainted: G B 3.9.0-rc8-dgc+ #592 Call Trace: [<ffffffff814e43a5>] __xfs_efi_release+0x35/0x80 [<ffffffff814e444d>] xfs_efi_item_unpin+0x5d/0x70 [<ffffffff814dbb88>] xfs_trans_committed_bulk+0x248/0x2e0 [<ffffffff814e18db>] xlog_cil_committed+0x3b/0x140 [<ffffffff814ddfae>] xlog_state_do_callback+0x19e/0x3a0 ..... That's the EFI being unpinned and it's reference going away during transaction commit completion procesing. __xfs_efi_release 0xffff88006c4d7058 1 Pid: 4756, comm: kworker/1:1H Tainted: G B 3.9.0-rc8-dgc+ #592 Call Trace: [<ffffffff814e43a5>] __xfs_efi_release+0x35/0x80 [<ffffffff814e4620>] xfs_efi_release+0x60/0x80 [<ffffffff814e46a6>] xfs_efd_item_committed+0x26/0x40 [<ffffffff814dba21>] xfs_trans_committed_bulk+0xe1/0x2e0 [<ffffffff814e18db>] xlog_cil_committed+0x3b/0x140 [<ffffffff814ddfae>] xlog_state_do_callback+0x19e/0x3a0 ..... That's the EFD releasing the EFI, which drops it's reference count and frees the both the EFI and the EFD. __xfs_efi_release 0xffff88006c4d7058 1802201963 Pid: 4756, comm: kworker/1:1H Tainted: G B 3.9.0-rc8-dgc+ #592 Call Trace: [<ffffffff814e43a5>] __xfs_efi_release+0x35/0x80 [<ffffffff814e4630>] xfs_efi_release+0x70/0x80 [<ffffffff814e46a6>] xfs_efd_item_committed+0x26/0x40 [<ffffffff814dba21>] xfs_trans_committed_bulk+0xe1/0x2e0 [<ffffffff814e18db>] xlog_cil_committed+0x3b/0x140 ..... And there's the EFI being released by an EFD after it has been freed. It's definitely been freed because 1802201963 = 0x6b6b6b6b.... Like I said, I don't understand how this is happening yet, but just seeing this behaviour explains an awful lot about why we've had long term issues with this code.... FWIW, I used to run with slab debugging turned on for all my debug builds. I switched my .config from SLAB to SLUB a few months ago, and didn't realise this silently turned of memory poisoning. That's how it slipped through my testing.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs