Re: Regular FS shutdown while rsync is running

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote:
> On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote:
> > Am Dienstag, den 22.01.2019, 08:02 -0500 schrieb Brian Foster:
> > > On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote:
> > > > Hi Brian,
> > > > 
> > > > Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster:
> > > > [...]
> > > > > > So for the moment, here's the output of the above sequence.
> > > > > > 
> > > > > > xfs_db> convert agno 5 agbno 7831662 fsb
> > > > > > 0x5077806e (1350008942)
> > > > > > xfs_db> fsb 0x5077806e
> > > > > > xfs_db> type finobt
> > > > > > xfs_db> print
> > > > > > magic = 0x49414233
> > > > > > level = 1
> > > > > > numrecs = 335
> > > > > > leftsib = 7810856
> > > > > > rightsib = null
> > > > > > bno = 7387612016
> > > > > > lsn = 0x6671003d9700
> > > > > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec
> > > > > > owner = 5
> > > > > > crc = 0xe12b19b2 (correct)
> > > > > 
> > > > > As expected, we have the inobt magic. Interesting that this is a fairly
> > > > > full intermediate (level > 0) node. There is no right sibling, which
> > > > > means we're at the far right end of the tree. I wouldn't mind poking
> > > > > around a bit more at the tree, but that might be easier with access to
> > > > > the metadump. I also think that xfs_repair would have complained were
> > > > > something more significant wrong with the tree.
> > > > > 
> > > > > Hmm, I wonder if the (lightly tested) diff below would help us catch
> > > > > anything. It basically just splits up the currently combined inobt and
> > > > > finobt I/O verifiers to expect the appropriate magic number (rather than
> > > > > accepting either magic for both trees). Could you give that a try?
> > > > > Unless we're doing something like using the wrong type of cursor for a
> > > > > particular tree, I'd think this would catch wherever we happen to put a
> > > > > bad magic on disk. Note that this assumes the underlying filesystem has
> > > > > been repaired so as to try and detect the next time an on-disk
> > > > > corruption is introduced.
> > > > > 
> > > > > You'll also need to turn up the XFS error level to make sure this prints
> > > > > out a stack trace if/when a verifier failure triggers:
> > > > > 
> > > > > echo 5 > /proc/sys/fs/xfs/error_level
> > > > > 
> > > > > I guess we also shouldn't rule out hardware issues or whatnot. I did
> > > > > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a
> > > > > distro kernel? Has it been modified from upstream in any way? If so, I'd
> > > > > strongly suggest to try and confirm whether this is reproducible with an
> > > > > upstream kernel.
> > > > 
> > > > With the finobt verifier changes applied we are unable to mount the FS,
> > > > even after running xfs_repair.
> > > > 
> > > > xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703",
> > > > which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips
> > > > over a buffer at a different daddr though:
> > > > 
> > > 
> > > So the mount failed, you ran repair and discovered the bad magic..? That
> > > suggests there was still an issue with the fs on-disk. Could you run
> > > 'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of
> > > errors before it is mounted? Note that xfs_repair unconditionally
> > > regenerates certain metadata structures (like the finobt) from scratch
> > > so there is always the possibility that xfs_repair itself is introducing
> > > some problem in the fs.
> > 
> > The sequence was:
> > 1. Trying to mount with the debug kernel, which didn't work due to the
> > finobt verifier rejecting a bad btree node.
> > 2. Full run of mutating xfs_repair (which did find a finobt node with
> > bad magic)
> > 3. Try to mount again, still fails due to finobt verifier rejecting a
> > bad btree node, but at a different daddr than what was flagged by
> > xfs_repair.
> > 4. xfs_repair -n finds a bad magic finobt node at the same position as
> > the mount splat.
> > 
> > This leads me to wonder if xfs_repair was the poison rather than the
> > cure in this case. After digging into the xfs_repair code I found the
> > following, which looks suspicious, but I don't know enough about XFS
> > internals to tell if this can actually happen:
> > 
> 
> That does imply xfs_repair is the culprit. If you haven't, could you try
> the xfs_repair -> xfs_repair -n sequence without a mount attempt in the
> picture just to be sure? That may also explain why the block seems to
> shift around. It wouldn't surprise me if it's the same relative block in
> the tree, but IIRC the tree itself can be shifted around run to run.
> 
> > build_ino_tree(), which is shared between XFS_BTNUM_INO and
> > XFS_BTNUM_FINO btree rebuilds contains the following in one of the
> > loops:
> > 
> > if (lptr->num_recs_pb > 0)
> > 		prop_ino_cursor(mp, agno, btree_curs,
> > 				ino_rec->ino_startnum, 0);
> > 
> > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum
> > parameter being a fixed XFS_BTNUM_INO.
> > 
> 
> It's been a while since I've looked at the guts of this code but that
> definitely does not look right. build_ino_tree() is where we abstract
> construction of the different tree types. Nice catch!
> 
> If this is the problem, I'll have to take a closer look to figure out
> why this problem isn't more prevalent..
> 

Hmm... so it looks like this is the code responsible for _growing_ the
interior nodes of the tree (i.e., level > 0). If I'm following the code
correctly, the high level algorithm is to first set up a block for each
level of the tree. This occurs early in build_ino_tree() and uses the
correct btnum. We then start populating the leaf nodes with the finobt
records that have previously been built up in-core. For each leaf block,
we populate the higher level block with the key based on the first
record in the leaf then fill in the remaining records in the leaf. This
repeats until the block at the next level up is full, at which point we
grab a new level > 0 block and initialize it with the inobt magic.

I haven't confirmed, but I think this basically means that we'd
initialize every intermediate block after the left-most block at each
level in the tree with the inobt magic. I think Dave already recognized
in the other subthread that you have a 2 block level 1 tree. I suspect
the only reasons we haven't run into this until now are that the finobt
is intended to remain fairly small by only tracking free inode records
and thus prioritizing allocation from those records (removing them from
the tree) and this is limited to users who repair a filesystem in such a
state.

I'm guessing you have a large enough set of inodes and a sustained
enough deletion pattern in your workload that you eventually free enough
inodes across a large enough set of records to populate a multi-block
level 1 tree. Then as soon as you try to allocate from that side of the
tree the finer grained checks in __xfs_btree_check_[l|s]block() discover
the wrong magic and generate the error.

The positive here is that this isn't really a coherency problem with the
filesystem. The finobt itself is coherent, consistent with the inobt and
the magic value doesn't really dictate how the trees are used/updated at
runtime. This means that the only real corruption should be the magic
value.

I am wondering whether this warrants a special case in the magic
checking to convert the runtime internal error into more of a warning to
unmount and run xfs_repair (similar to how we detected the agfl size
mismatch problem). Dave, any thoughts on that?

Brian

> > So if it's possible to get into this code path while rebuilding the
> > finobt, xfs_repair will generate a otherwise valid btree node with the
> > wrong magic, which matches what we see.
> > 
> > Sadly we have no more debug information available about the state of
> > the FS on the first crash, which lead us to run xfs_repair in the first
> > place.
> > 
> 
> Naturally, the next question would have been what required repair in the
> first place. :P
> 
> Anyways, if you wanted to give this a quick try to confirm it looks like
> a fairly straightforward change. See the appended xfsprogs patch. Note
> again that you can (and should) run this against the restored metadump
> that captures the corruption before running against the primary fs.
> 
> Brian
> 
> --- 8< ---
> 
> diff --git a/repair/phase5.c b/repair/phase5.c
> index 85d1f4fb..05333f11 100644
> --- a/repair/phase5.c
> +++ b/repair/phase5.c
> @@ -983,7 +983,7 @@ init_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  
>  static void
>  prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
> -	xfs_agino_t startino, int level)
> +	xfs_btnum_t btnum, xfs_agino_t startino, int level)
>  {
>  	struct xfs_btree_block	*bt_hdr;
>  	xfs_inobt_key_t		*bt_key;
> @@ -1005,7 +1005,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  		 * first path up the left side of the tree
>  		 * where the agbno's are already set up
>  		 */
> -		prop_ino_cursor(mp, agno, btree_curs, startino, level);
> +		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
>  	}
>  
>  	if (be16_to_cpu(bt_hdr->bb_numrecs) ==
> @@ -1041,7 +1041,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  		lptr->buf_p->b_ops = &xfs_inobt_buf_ops;
>  		bt_hdr = XFS_BUF_TO_BLOCK(lptr->buf_p);
>  		memset(bt_hdr, 0, mp->m_sb.sb_blocksize);
> -		libxfs_btree_init_block(mp, lptr->buf_p, XFS_BTNUM_INO,
> +		libxfs_btree_init_block(mp, lptr->buf_p, btnum,
>  					level, 0, agno, 0);
>  
>  		bt_hdr->bb_u.s.bb_leftsib = cpu_to_be32(lptr->prev_agbno);
> @@ -1049,7 +1049,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs,
>  		/*
>  		 * propagate extent record for first extent in new block up
>  		 */
> -		prop_ino_cursor(mp, agno, btree_curs, startino, level);
> +		prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level);
>  	}
>  	/*
>  	 * add inode info to current block
> @@ -1201,7 +1201,7 @@ build_ino_tree(xfs_mount_t *mp, xfs_agnumber_t agno,
>  			lptr->modulo--;
>  
>  		if (lptr->num_recs_pb > 0)
> -			prop_ino_cursor(mp, agno, btree_curs,
> +			prop_ino_cursor(mp, agno, btree_curs, btnum,
>  					ino_rec->ino_startnum, 0);
>  
>  		bt_rec = (xfs_inobt_rec_t *)



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux