On Thu, Nov 24, 2022 at 03:40:23PM +1100, Dave Chinner wrote: > On Wed, Nov 23, 2022 at 12:00:55PM -0800, Darrick J. Wong wrote: > > On Tue, Nov 22, 2022 at 12:58:06PM +1100, Dave Chinner wrote: > > > On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote: > > > > Hi all, > > > > > > > > I've been running near-continuous integration testing of online fsck, > > > > and I've noticed that once a day, one of the ARM VMs will fail the test > > > > with out of order records in the data fork. > ..... > > In the latest run, I got this trace data: > > > > ino 0x600a754 nr 0x5 offset 0x304 nextoff 0x31e > > ino 0x600a754 func xfs_bmapi_reserve_delalloc line 4152 data fork: > > ino 0x600a754 nr 0x0 nr_real 0x0 offset 0x7 blockcount 0xe startblock 0xc12867 state 0 > > ino 0x600a754 nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1 > > ino 0x600a754 nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1 > > ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1 > > ino 0x600a754 nr 0x4 nr_real 0x4 offset 0x318 blockcount 0x6 startblock 0xffffffffe0007 state 0 > > ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0 > > ino 0x600a754 nr 0x6 nr_real 0x5 offset 0xaf8 blockcount 0x1a startblock 0xd17aa3 state 0 > > ino 0x600a754 nr 0x7 nr_real 0x6 offset 0x12f7 blockcount 0x40 startblock 0xcca511 state 1 > > ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0 > > ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1 > > > > Here we again see that a delalloc extent was inserted into the wrong > > position in the iext leaf, same as last time. The extra trace data I > > collected are as follows: > > > > ino 0x600a754 fork 0 oldoff 0x318 oldlen 0x6 oldprealloc 0x0 isize 0x307e14c > > ino 0x600a754 oldgotoff 0xaf8 oldgotstart 0xd17aa3 oldgotcount 0x1a oldgotstate 0 > > ino 0x600a754 freshgotoff 0xe0e46156d65cb freshgotstart 0xd178b5 freshgotcount 0x1a freshgotstate 0 > > ino 0x600a754 nowgotoff 0x318 nowgotstart 0xffffffffe0007 nowgotcount 0x6 nowgotstate 0 > > ino 0x600a754 oldicurpos 4 oldleafnr 9 oldleaf 0xfffffc012d8f4680 > > ino 0x600a754 newicurpos 4 newleafnr 10 newleaf 0xfffffc012304d800 > > ..... > > > Line 5 is a copy of @icur at the beginning fo the call, and line 6 is > > the contents of @icur after the xfs_bmap_add_extent_hole_delay call. > > Notice that the cursor positions are the same, but the leaf pointers are > > different. I suspect that leaf ~d8f4680 has been freed, and this is the > > reason why freshgot is totally garbage. > > That implies that a extent tree modification is being made whilst > the delalloc function is holding the ILOCK_EXCL. Either rwsems on > ARM are broken (entirely possible given the subtle memory ordering > of the slow paths has caused this sort of problem on x64-64 multiple > times in the past), or something else isn't holding the ILOCK_EXCL > while modifying the iext tree. Hmm, maybe I'll try turning on lockdep for arm64 and see if it has anythign interesting (ha!) to say. Well, it /is/ a long holiday weekend.... > > I wonder if the leaf pointers > > being different is the result of an iext btree splitting into 2 objects > > and then being recombined into one? > > That implies multiple operations occurred - the single leaf won't > split until it is full - that's when we add the 15th record to the > tree. We're nowhere near that. And it won't attempt a rebalance that > may merge the the two leaf nodes until a remove occurs and the > number of entries in that leaf drops to half full (7 entries) and > the combined total is less that a full leaf. > > Further, this is the left-most leaf, so any split will allocate a > new right sibling and move the entries right in to the new node. If > either the new right or the left node then rebalance, they will > always merge to the left-most leaf and free the right leaf. IOWs, > a grow-split-shrink-merge on the left most leaf in the tree will not > change the address of that left-most leaf - the right leave will get > allocated then freed... > > So I'm not sure that even a grow-split-shrink-merge has occurred > concurrently here..... > > Oh.... inserting into the root block of the tree results in > xfs_iext_realloc_root() being called, and krealloc() is called to > grow the root leaf block. that points cur->leaf at the newly > reallocated chunk of memory. > > Ok, so the change in cursor is to be expected. There's nothing wrong > with the cursor, or that we have a reallocated the root leaf. Ah, ok. > That leaves something not holding the right lock when inserting > a new extent,, or rwsems are broken on ARM. That reminded me that a month or two back I also saw this crash when alwasycow is turned on and we start runnng generic/522: stack segment: 0000 [#1] PREEMPT SMP CPU: 0 PID: 430863 Comm: fsstress Not tainted 6.0.0-xfsx #6.0.0 ae8d0391a7a281b411e9d54eb9b1c5c85ef7dbc0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014 RIP: 0010:xfs_iext_prev+0x71/0x150 [xfs] Code: 41 89 5c 24 08 74 33 85 db 41 8b 55 14 78 38 83 fa 01 b8 0f 00 00 00 74 4c 39 c3 7d 1d 48 63 f3 48 83 fe 0f 0f 87 c4 00 00 00 <48> 83 7d 00 00 74 09 5b 5d 41 5c 41 5d 4 d 10 85 db RSP: 0018:ffffc9000207fab8 EFLAGS: 00010297 RAX: 000000000000000f RBX: 000000000000000e RCX: 000000000000000c RDX: 0000000000000002 RSI: 000000000000000e RDI: ffff88815598d6c0 RBP: 000229e31f0000e9 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffffc9000207fb10 R13: ffff88815598d6c0 R14: 000229e31f000001 R15: 000ffffffffe0000 FS: 00007f46c03c4740(0000) GS:ffff88842d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f46c03c3000 CR3: 000000014b16c000 CR4: 00000000003506f0 Call Trace: <TASK> xfs_iomap_prealloc_size.constprop.0.isra.0+0x1a6/0x410 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] ? __filemap_add_folio+0x1d9/0x590 ? xfs_ilock+0xb5/0x1f0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] xfs_buffered_write_iomap_begin+0xa66/0xbc0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] ? __filemap_get_folio+0x15c/0x330 ? xfs_buffered_write_iomap_end+0x63/0x190 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] iomap_iter+0x122/0x2c0 iomap_file_buffered_write+0x92/0x360 xfs_file_buffered_write+0xb1/0x330 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] vfs_write+0x2eb/0x410 ksys_write+0x65/0xe0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f46c04dba37 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fffd7ea9858 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000001192f RCX: 00007f46c04dba37 RDX: 000000000001192f RSI: 0000558ce0a4f600 RDI: 0000000000000004 RBP: 0000000000000004 R08: 0000000000000000 R09: 0000558ce0a4f600 R10: 00007f46c05e1430 R11: 0000000000000246 R12: 00000000000000e8 R13: 000000000005c0fe R14: 0000558ce0a4f600 R15: 0000000000000000 So I wonder if this is the same kind of thing -- someone kreallocates the iext leaf block without taking ILOCK, and now our poor thread here walks off the dead leaf pointer in the cursor. Or maybe this is a totally separate problem, who knows. I've only seen it twice in the last 6 months. > > I augmented the xfs_iext_* functions to check the ILOCK state in all > > functions that are passed an xfs_inode. None of them tripped across the > > entire fstests cloud run last night, so there's no obvious problem > > there. The buffered write path takes ILOCK_EXCL and keeps it right up > > to where the debug splat happens, so there's no locking problem there. > > > > So I started looking for things that could shift the extent count by one. > > Looking for semi-adjacent records, I noticed this: > > > > nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1 > > nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1 > > > > 0xc70d25+0xc8 == 0xc70ded, so this could be the result of someone > > punching 0x10 blocks. > > > > ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1 > > ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0 > > > > The incorrect delalloc reservation notwithstanding, these two records > > are logically and physically adjacent, with the only difference being > > that one is unwritten and the other is not. Someone could have > > converted an unwritten extent to written, possibly as a result of a > > post-write conversion? > > > > ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0 > > ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1 > > > > The 0xc70ded here is familiar -- I wonder if this got mapped here as a > > result of an FIEXCHANGE'd with offset 0x140? Extent 9 is adjacent with > > extent 8, except for the state difference. Hmm. I guess I better go > > check the FIEXCHANGE code... > > And there's no such thing a FIEXCHANGE in the upstream current code > base, so that would be a good place to look for shenanigans.... :) Hm. I looked over the code for that, and it takes ILOCK_EXCL on both inodes involved in the file contents exchange, and holds it past the final transaction commit + defer chain processing. So I don't think it's there. --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx