Re: moar weird metadata corruptions, this time on arm64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 23, 2022 at 12:00:55PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 22, 2022 at 12:58:06PM +1100, Dave Chinner wrote:
> > On Mon, Nov 21, 2022 at 04:16:02PM -0800, Darrick J. Wong wrote:
> > > Hi all,
> > > 
> > > I've been running near-continuous integration testing of online fsck,
> > > and I've noticed that once a day, one of the ARM VMs will fail the test
> > > with out of order records in the data fork.
.....
> In the latest run, I got this trace data:
> 
> ino 0x600a754 nr 0x5 offset 0x304 nextoff 0x31e
> ino 0x600a754 func xfs_bmapi_reserve_delalloc line 4152 data fork:
>     ino 0x600a754 nr 0x0 nr_real 0x0 offset 0x7 blockcount 0xe startblock 0xc12867 state 0
>     ino 0x600a754 nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
>     ino 0x600a754 nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
>     ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
>     ino 0x600a754 nr 0x4 nr_real 0x4 offset 0x318 blockcount 0x6 startblock 0xffffffffe0007 state 0
>     ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
>     ino 0x600a754 nr 0x6 nr_real 0x5 offset 0xaf8 blockcount 0x1a startblock 0xd17aa3 state 0
>     ino 0x600a754 nr 0x7 nr_real 0x6 offset 0x12f7 blockcount 0x40 startblock 0xcca511 state 1
>     ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
>     ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> 
> Here we again see that a delalloc extent was inserted into the wrong
> position in the iext leaf, same as last time.  The extra trace data I
> collected are as follows:
> 
> ino 0x600a754 fork 0 oldoff 0x318 oldlen 0x6 oldprealloc 0x0 isize 0x307e14c
>     ino 0x600a754 oldgotoff 0xaf8 oldgotstart 0xd17aa3 oldgotcount 0x1a oldgotstate 0
>     ino 0x600a754 freshgotoff 0xe0e46156d65cb freshgotstart 0xd178b5 freshgotcount 0x1a freshgotstate 0
>     ino 0x600a754 nowgotoff 0x318 nowgotstart 0xffffffffe0007 nowgotcount 0x6 nowgotstate 0
>     ino 0x600a754 oldicurpos 4 oldleafnr 9 oldleaf 0xfffffc012d8f4680
>     ino 0x600a754 newicurpos 4 newleafnr 10 newleaf 0xfffffc012304d800

.....

> Line 5 is a copy of @icur at the beginning fo the call, and line 6 is
> the contents of @icur after the xfs_bmap_add_extent_hole_delay call.
> Notice that the cursor positions are the same, but the leaf pointers are
> different.  I suspect that leaf ~d8f4680 has been freed, and this is the
> reason why freshgot is totally garbage. 

That implies that a extent tree modification is being made whilst
the delalloc function is holding the ILOCK_EXCL. Either rwsems on
ARM are broken (entirely possible given the subtle memory ordering
of the slow paths has caused this sort of problem on x64-64 multiple
times in the past), or something else isn't holding the ILOCK_EXCL
while modifying the iext tree.

> I wonder if the leaf pointers
> being different is the result of an iext btree splitting into 2 objects
> and then being recombined into one?

That implies multiple operations occurred - the single leaf won't
split until it is full - that's when we add the 15th record to the
tree. We're nowhere near that. And it won't attempt a rebalance that
may merge the the two leaf nodes until a remove occurs and the
number of entries in that leaf drops to half full (7 entries) and
the combined total is less that a full leaf.

Further, this is the left-most leaf, so any split will allocate a
new right sibling and move the entries right in to the new node. If
either the new right or the left node then rebalance, they will
always merge to the left-most leaf and free the right leaf. IOWs,
a grow-split-shrink-merge on the left most leaf in the tree will not
change the address of that left-most leaf - the right leave will get
allocated then freed...

So I'm not sure that even a grow-split-shrink-merge has occurred
concurrently here.....

Oh.... inserting into the root block of the tree results in
xfs_iext_realloc_root() being called, and krealloc() is called to
grow the root leaf block. that points cur->leaf at the newly
reallocated chunk of memory.

Ok, so the change in cursor is to be expected. There's nothing wrong
with the cursor, or that we have a reallocated the root leaf.

That leaves something not holding the right lock when inserting
a new extent,, or rwsems are broken on ARM.

> I augmented the xfs_iext_* functions to check the ILOCK state in all
> functions that are passed an xfs_inode.  None of them tripped across the
> entire fstests cloud run last night, so there's no obvious problem
> there.  The buffered write path takes ILOCK_EXCL and keeps it right up
> to where the debug splat happens, so there's no locking problem there.
> 
> So I started looking for things that could shift the extent count by one.
> Looking for semi-adjacent records, I noticed this:
> 
> nr 0x1 nr_real 0x1 offset 0x78 blockcount 0xc8 startblock 0xc70d25 state 1
> nr 0x2 nr_real 0x2 offset 0x150 blockcount 0x16 startblock 0xc70dfd state 1
> 
> 0xc70d25+0xc8 == 0xc70ded, so this could be the result of someone
> punching 0x10 blocks.
> 
> ino 0x600a754 nr 0x3 nr_real 0x3 offset 0x2a5 blockcount 0x5f startblock 0xc9f218 state 1
> ino 0x600a754 nr 0x5 nr_real 0x4 offset 0x304 blockcount 0x14 startblock 0xc9f277 state 0
> 
> The incorrect delalloc reservation notwithstanding, these two records
> are logically and physically adjacent, with the only difference being
> that one is unwritten and the other is not.  Someone could have
> converted an unwritten extent to written, possibly as a result of a
> post-write conversion?
> 
> ino 0x600a754 nr 0x8 nr_real 0x7 offset 0x307c blockcount 0x3 startblock 0xc70ded state 0
> ino 0x600a754 nr 0x9 nr_real 0x8 offset 0x307f blockcount 0x1 startblock 0xc70df0 state 1
> 
> The 0xc70ded here is familiar -- I wonder if this got mapped here as a
> result of an FIEXCHANGE'd with offset 0x140?  Extent 9 is adjacent with
> extent 8, except for the state difference.  Hmm.  I guess I better go
> check the FIEXCHANGE code...

And there's no such thing a FIEXCHANGE in the upstream current code
base, so that would be a good place to look for shenanigans.... :)

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux