On Wed, Nov 23, 2022 at 10:46:13PM -0800, Darrick J. Wong wrote: > On Thu, Nov 24, 2022 at 03:40:23PM +1100, Dave Chinner wrote: > > On Wed, Nov 23, 2022 at 12:00:55PM -0800, Darrick J. Wong wrote: > > > On Tue, Nov 22, 2022 at 12:58:06PM +1100, Dave Chinner wrote: > > > > On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote: > > > > > Hi all, > > > > > > > > > > I've been running near-continuous integration testing of online fsck, > > > > > and I've noticed that once a day, one of the ARM VMs will fail the test > > > > > with out of order records in the data fork. > > ..... > > > In the latest run, I got this trace data: > > > > > > ino 0x600a754 nr 0x5 offset 0x304 nextoff 0x31e > > > ino 0x600a754 func xfs_bmapi_reserve_delalloc line 4152 data fork: > > > ino 0x600a754 nr 0x0 nr_real 0x0 offset 0x7 blockcount 0xe startblock 0xc12867 state 0 > > > ino 0x600a754 nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1 > > > ino 0x600a754 nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1 > > > ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1 > > > ino 0x600a754 nr 0x4 nr_real 0x4 offset 0x318 blockcount 0x6 startblock 0xffffffffe0007 state 0 > > > ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0 > > > ino 0x600a754 nr 0x6 nr_real 0x5 offset 0xaf8 blockcount 0x1a startblock 0xd17aa3 state 0 > > > ino 0x600a754 nr 0x7 nr_real 0x6 offset 0x12f7 blockcount 0x40 startblock 0xcca511 state 1 > > > ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0 > > > ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1 > > > > > > Here we again see that a delalloc extent was inserted into the wrong > > > position in the iext leaf, same as last time. The extra trace data I > > > collected are as follows: > > > > > > ino 0x600a754 fork 0 oldoff 0x318 oldlen 0x6 oldprealloc 0x0 isize 0x307e14c > > > ino 0x600a754 oldgotoff 0xaf8 oldgotstart 0xd17aa3 oldgotcount 0x1a oldgotstate 0 > > > ino 0x600a754 freshgotoff 0xe0e46156d65cb freshgotstart 0xd178b5 freshgotcount 0x1a freshgotstate 0 > > > ino 0x600a754 nowgotoff 0x318 nowgotstart 0xffffffffe0007 nowgotcount 0x6 nowgotstate 0 > > > ino 0x600a754 oldicurpos 4 oldleafnr 9 oldleaf 0xfffffc012d8f4680 > > > ino 0x600a754 newicurpos 4 newleafnr 10 newleaf 0xfffffc012304d800 > > > > ..... > > > > > Line 5 is a copy of @icur at the beginning fo the call, and line 6 is > > > the contents of @icur after the xfs_bmap_add_extent_hole_delay call. > > > Notice that the cursor positions are the same, but the leaf pointers are > > > different. I suspect that leaf ~d8f4680 has been freed, and this is the > > > reason why freshgot is totally garbage. > > > > That implies that a extent tree modification is being made whilst > > the delalloc function is holding the ILOCK_EXCL. Either rwsems on > > ARM are broken (entirely possible given the subtle memory ordering > > of the slow paths has caused this sort of problem on x64-64 multiple > > times in the past), or something else isn't holding the ILOCK_EXCL > > while modifying the iext tree. > > Hmm, maybe I'll try turning on lockdep for arm64 and see if it has > anythign interesting (ha!) to say. Well, it /is/ a long holiday > weekend.... > > > > I wonder if the leaf pointers > > > being different is the result of an iext btree splitting into 2 objects > > > and then being recombined into one? > > > > That implies multiple operations occurred - the single leaf won't > > split until it is full - that's when we add the 15th record to the > > tree. We're nowhere near that. And it won't attempt a rebalance that > > may merge the the two leaf nodes until a remove occurs and the > > number of entries in that leaf drops to half full (7 entries) and > > the combined total is less that a full leaf. > > > > Further, this is the left-most leaf, so any split will allocate a > > new right sibling and move the entries right in to the new node. If > > either the new right or the left node then rebalance, they will > > always merge to the left-most leaf and free the right leaf. IOWs, > > a grow-split-shrink-merge on the left most leaf in the tree will not > > change the address of that left-most leaf - the right leave will get > > allocated then freed... > > > > So I'm not sure that even a grow-split-shrink-merge has occurred > > concurrently here..... > > > > Oh.... inserting into the root block of the tree results in > > xfs_iext_realloc_root() being called, and krealloc() is called to > > grow the root leaf block. that points cur->leaf at the newly > > reallocated chunk of memory. > > > > Ok, so the change in cursor is to be expected. There's nothing wrong > > with the cursor, or that we have a reallocated the root leaf. > > Ah, ok. > > > That leaves something not holding the right lock when inserting > > a new extent,, or rwsems are broken on ARM. > > That reminded me that a month or two back I also saw this crash when > alwasycow is turned on and we start runnng generic/522: > > stack segment: 0000 [#1] PREEMPT SMP > CPU: 0 PID: 430863 Comm: fsstress Not tainted 6.0.0-xfsx #6.0.0 ae8d0391a7a281b411e9d54eb9b1c5c85ef7dbc0 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014 > RIP: 0010:xfs_iext_prev+0x71/0x150 [xfs] > Code: 41 89 5c 24 08 74 33 85 db 41 8b 55 14 78 38 83 fa 01 b8 0f 00 00 00 74 4c 39 c3 7d 1d 48 63 f3 48 83 fe 0f 0f 87 c4 00 00 00 <48> 83 7d 00 00 74 09 5b 5d 41 5c 41 5d 4 > d 10 85 db > RSP: 0018:ffffc9000207fab8 EFLAGS: 00010297 > RAX: 000000000000000f RBX: 000000000000000e RCX: 000000000000000c > RDX: 0000000000000002 RSI: 000000000000000e RDI: ffff88815598d6c0 > RBP: 000229e31f0000e9 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000001000 R12: ffffc9000207fb10 > R13: ffff88815598d6c0 R14: 000229e31f000001 R15: 000ffffffffe0000 > FS: 00007f46c03c4740(0000) GS:ffff88842d000000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f46c03c3000 CR3: 000000014b16c000 CR4: 00000000003506f0 > Call Trace: > <TASK> > xfs_iomap_prealloc_size.constprop.0.isra.0+0x1a6/0x410 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] > ? __filemap_add_folio+0x1d9/0x590 > ? xfs_ilock+0xb5/0x1f0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] > xfs_buffered_write_iomap_begin+0xa66/0xbc0 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] > ? __filemap_get_folio+0x15c/0x330 > ? xfs_buffered_write_iomap_end+0x63/0x190 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] > iomap_iter+0x122/0x2c0 > iomap_file_buffered_write+0x92/0x360 > xfs_file_buffered_write+0xb1/0x330 [xfs 71c030f9aed06cc72c1a12695c507ccebedb1b15] > vfs_write+0x2eb/0x410 > ksys_write+0x65/0xe0 > do_syscall_64+0x2b/0x80 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f46c04dba37 > Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > RSP: 002b:00007fffd7ea9858 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 000000000001192f RCX: 00007f46c04dba37 > RDX: 000000000001192f RSI: 0000558ce0a4f600 RDI: 0000000000000004 > RBP: 0000000000000004 R08: 0000000000000000 R09: 0000558ce0a4f600 > R10: 00007f46c05e1430 R11: 0000000000000246 R12: 00000000000000e8 > R13: 000000000005c0fe R14: 0000558ce0a4f600 R15: 0000000000000000 > > So I wonder if this is the same kind of thing -- someone kreallocates > the iext leaf block without taking ILOCK, and now our poor thread here > walks off the dead leaf pointer in the cursor. > > Or maybe this is a totally separate problem, who knows. I've only seen > it twice in the last 6 months. > > > > I augmented the xfs_iext_* functions to check the ILOCK state in all > > > functions that are passed an xfs_inode. None of them tripped across the > > > entire fstests cloud run last night, so there's no obvious problem > > > there. The buffered write path takes ILOCK_EXCL and keeps it right up > > > to where the debug splat happens, so there's no locking problem there. > > > > > > So I started looking for things that could shift the extent count by one. > > > Looking for semi-adjacent records, I noticed this: > > > > > > nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1 > > > nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1 > > > > > > 0xc70d25+0xc8 == 0xc70ded, so this could be the result of someone > > > punching 0x10 blocks. > > > > > > ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1 > > > ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0 > > > > > > The incorrect delalloc reservation notwithstanding, these two records > > > are logically and physically adjacent, with the only difference being > > > that one is unwritten and the other is not. Someone could have > > > converted an unwritten extent to written, possibly as a result of a > > > post-write conversion? > > > > > > ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0 > > > ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1 > > > > > > The 0xc70ded here is familiar -- I wonder if this got mapped here as a > > > result of an FIEXCHANGE'd with offset 0x140? Extent 9 is adjacent with > > > extent 8, except for the state difference. Hmm. I guess I better go > > > check the FIEXCHANGE code... > > > > And there's no such thing a FIEXCHANGE in the upstream current code > > base, so that would be a good place to look for shenanigans.... :) > > Hm. I looked over the code for that, and it takes ILOCK_EXCL on both > inodes involved in the file contents exchange, and holds it past the > final transaction commit + defer chain processing. So I don't think > it's there. Also, last night's run produced this: ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork: ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1 ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0 ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0 ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000 ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0 ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0 ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0 ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0 ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00 ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00 ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00 ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00 The old/fresh/nowgot have the same meaning as yesterday. "crapgot" is the results of duplicating the cursor at the start of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup at @off. I think @oldgot is a HOLESTARTBLOCK extent because the first lookup didn't find anything, so we filled in imap with "fake hole until the end". At the time of the first lookup, I suspect that there's only one 32-block unwritten extent in the mapping (hence oldicurpos==1) but by the time we get to recording crapgot, crapicurpos==2. IOWS, I think I can safely eliminate FIEXCHANGE shenanigans and concentrate on finding an unlocked unwritten -> written extent conversion. Or possibly a written -> unwritten extent conversion? Anyway, long holiday weekend, so I won't get back to this until Monday. Just wanted to persist my notes to the mailing list so I can move on to testing the write race fixes with djwong-dev. --D > --D > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx