Re: repeatable attribute corruption

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 16 Mar 2012 17:15:04 +1100

On Mon, Mar 12, 2012 at 10:00:15PM +0000, Roger Willcocks wrote:
> Hi folks,
> 
> On stock CentOS kernels from 5.5 (earliest I tried) through to the
> latest 6.2 kernel (as of five days ago) 2.6.32-220.7.1.el6.x86_64 I can
> repeatedly create a silent on-disk corruption. This typically shows up
> later as:
> 
> XFS internal error xfs_da_do_buf(1) at line 2020 of
> file /usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/fs/xfs/xfs_da_btree.c
> 
> (or line 2112 of the same file).
> 
> The 'before' metadump is a bit big to attach to an email (~600k) so
> here's a download link valid for 30 days -
> 
> http://private.filmlight.ltd.uk/c4c864ecca4ac13b/xfs_attrib_crash.taz
> 
> - this is a gzip-compressed tar file containing 
> 
> -rw-r--r--.  1 root root   10885632 Mar 12 18:02 xfs_metadump_hda6
> -rw-r--r--.  1 root root       3558 Mar 12 18:15 zap.txt
> 
> The metadump expands to about 5GB. xfs_repair believes it to be clean.
> I've not obfuscated the dump; it was originally a copy of the linux
> header directory from the 5.5 kernel.
> 
> 'zap.txt' simply overwrites a single existing extended (directory)
> attribute with a slightly longer value. So, steps to repeat:
> 
> # xfs_mdrestore xfs_metadump_hda6 xfs.img 
> # mkdir /mnt/disk1
> # mount -o loop xfs.img /mnt/disk1
> # setfattr --restore=zap.txt
> # umount /mnt/disk1
> # xfs_repair -n xfs.img
> ...
> bad sibling back pointer for block 4 in attribute fork for inode 131
> problem with attribute contents in inode 131
> would clear attr fork
> bad nblocks 8 for inode 131, would reset to 3
> bad anextents 4 for inode 131, would reset to 0

OK, I think I have a handle on this now. It is a nasty corner case,
and I'll try to explain it (later most of this will end up in the
commit message).

What we have is the initial condition of a node format attribute
btree with two leaves at index 1 and 2. Call them L1 and L2.  The
leaf L1 is completely full, there is not a single byte of free space
in it. L2 is mostly empty.  The attribute being replaced - call it X
- is the last attribute in L1.

The way an attribute replace is executed is that the replacement
attribute - call it Y - is first inserted into the tree, but has an
INCOMPLETE flag set on it so that list traversals ignore it. Once
this transaction is committed, a second transaction it run to
atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
traversal will now find Y and skip X. Once that transaction is
committed, attribute X is then removed.

So, the initial condition is:

     +--------+     +--------+
     |   L1   |     |   L2   |
     | fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |
     | fsp: 0 |     | fsp: N |
     |--------|     |--------|
     | attr A |     | attr 1 |
     |--------|     |--------|
     | attr B |     | attr 2 |
     |--------|     |--------|
     ..........     ..........
     |--------|     |--------|
     | attr X |     | attr n |
     +--------+     +--------+

So now we go to replace X, and see that L1:fsp = 0 - it is full so
we can't insert Y in the same leaf. So we record the the location of
attribute X so we can track it for later use, then we split L1 into
L1 and L3 and reblance across the two leafs. We end with:

     +--------+     +--------+     +--------+
     |   L1   |     |   L3   |     |   L2   |
     | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|
     | attr A |     | attr X |     | attr 1 |
     |--------|     +--------+     |--------|
     | attr B |                    | attr 2 |
     |--------|                    |--------|
     ..........                    ..........
     |--------|                    |--------|
     | attr W |                    | attr n |
     +--------+                    +--------+

And we track that the original attribute is now at L3:0.

We then try to insert Y into L1 again, and find that there isn't
enough room because the new attribute is larger than the old one.
Hence we have to split again to make room for Y. We end up with
this:

     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     + INCOMP +     +--------+     |--------|
     | attr B |     +--------+                    | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

And now we have the new (incomplete) attribute @ L4:0, and the
original attribute at L3:0. At this point, the first transaction is
committed, and we move to the flipping of the flags.

This is where we are supposed to end up with this:

     +--------+     +--------+     +--------+     +--------+
     |   L1   |     |   L4   |     |   L3   |     |   L2   |
     | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
     | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
     | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
     |--------|     |--------|     |--------|     |--------|
     | attr A |     | attr Y |     | attr X |     | attr 1 |
     |--------|     +--------+     + INCOMP +     |--------|
     | attr B |                    +--------+     | attr 2 |
     |--------|                                   |--------|
     ..........                                   ..........
     |--------|                                   |--------|
     | attr W |                                   | attr n |
     +--------+                                   +--------+

But that doesn't happen properly - the attribute tracking indexes
are not pointing to the right locations, and so we modify random
attributes. We then try to remove attribute X, which then involves
leaf joins, and things then go pear-shaped because the in-memory
state is corrupt, resulting in the breakage we see on disk.

This double split on insertion is indeed a rare corner case becase
growing attributes is not a common operation, and it's even less
common that the attribute being grown is the last attribute in a
full leaf.

The root of the problem is the handling of this corner case - the
initial rebalance moves the entry that is being replaced to a new
block, and the tracking code gets this wrong - it no longer points
to the attribute that will be replaced. That's where it starts going
wrong, but the code that does this tracking is exceedinly twisty and
I don't yet fully understand all the cases it handles. Hence I can
fix this specific problem, but I'm currently causing regressions in
other cases.

I think the first thing I have to do is clean up the tracking code
to use descriptive variables names because it is all to easy to
confuse what "index/blkno" and "index2/blkno2" are actually holding
and that makes it much harder to understand the code.

But, that can wait: it is Beer O'Clock on a hot friday afternoon, so
I'm going take the dog for a walk and then go down to the pub....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs