On Wed, Jan 23, 2019 at 08:03:07AM -0500, Brian Foster wrote: > On Wed, Jan 23, 2019 at 07:11:53AM -0500, Brian Foster wrote: > > On Wed, Jan 23, 2019 at 12:14:17PM +0100, Lucas Stach wrote: > > > Am Dienstag, den 22.01.2019, 08:02 -0500 schrieb Brian Foster: > > > > On Tue, Jan 22, 2019 at 11:39:53AM +0100, Lucas Stach wrote: > > > > > Hi Brian, > > > > > > > > > > Am Montag, den 21.01.2019, 13:11 -0500 schrieb Brian Foster: > > > > > [...] > > > > > > > So for the moment, here's the output of the above sequence. > > > > > > > > > > > > > > xfs_db> convert agno 5 agbno 7831662 fsb > > > > > > > 0x5077806e (1350008942) > > > > > > > xfs_db> fsb 0x5077806e > > > > > > > xfs_db> type finobt > > > > > > > xfs_db> print > > > > > > > magic = 0x49414233 > > > > > > > level = 1 > > > > > > > numrecs = 335 > > > > > > > leftsib = 7810856 > > > > > > > rightsib = null > > > > > > > bno = 7387612016 > > > > > > > lsn = 0x6671003d9700 > > > > > > > uuid = 026711cc-25c7-44b9-89aa-0aac496edfec > > > > > > > owner = 5 > > > > > > > crc = 0xe12b19b2 (correct) > > > > > > > > > > > > As expected, we have the inobt magic. Interesting that this is a fairly > > > > > > full intermediate (level > 0) node. There is no right sibling, which > > > > > > means we're at the far right end of the tree. I wouldn't mind poking > > > > > > around a bit more at the tree, but that might be easier with access to > > > > > > the metadump. I also think that xfs_repair would have complained were > > > > > > something more significant wrong with the tree. > > > > > > > > > > > > Hmm, I wonder if the (lightly tested) diff below would help us catch > > > > > > anything. It basically just splits up the currently combined inobt and > > > > > > finobt I/O verifiers to expect the appropriate magic number (rather than > > > > > > accepting either magic for both trees). Could you give that a try? > > > > > > Unless we're doing something like using the wrong type of cursor for a > > > > > > particular tree, I'd think this would catch wherever we happen to put a > > > > > > bad magic on disk. Note that this assumes the underlying filesystem has > > > > > > been repaired so as to try and detect the next time an on-disk > > > > > > corruption is introduced. > > > > > > > > > > > > You'll also need to turn up the XFS error level to make sure this prints > > > > > > out a stack trace if/when a verifier failure triggers: > > > > > > > > > > > > echo 5 > /proc/sys/fs/xfs/error_level > > > > > > > > > > > > I guess we also shouldn't rule out hardware issues or whatnot. I did > > > > > > notice you have a strange kernel version: 4.19.4-holodeck10. Is that a > > > > > > distro kernel? Has it been modified from upstream in any way? If so, I'd > > > > > > strongly suggest to try and confirm whether this is reproducible with an > > > > > > upstream kernel. > > > > > > > > > > With the finobt verifier changes applied we are unable to mount the FS, > > > > > even after running xfs_repair. > > > > > > > > > > xfs_repair had found "bad magic # 0x49414233 in inobt block 5/2631703", > > > > > which would be daddr 0x1b5db40b8 according to xfs_db. The mount trips > > > > > over a buffer at a different daddr though: > > > > > > > > > > > > > So the mount failed, you ran repair and discovered the bad magic..? That > > > > suggests there was still an issue with the fs on-disk. Could you run > > > > 'xfs_repair -n' after the actual xfs_repair to confirm the fs is free of > > > > errors before it is mounted? Note that xfs_repair unconditionally > > > > regenerates certain metadata structures (like the finobt) from scratch > > > > so there is always the possibility that xfs_repair itself is introducing > > > > some problem in the fs. > > > > > > The sequence was: > > > 1. Trying to mount with the debug kernel, which didn't work due to the > > > finobt verifier rejecting a bad btree node. > > > 2. Full run of mutating xfs_repair (which did find a finobt node with > > > bad magic) > > > 3. Try to mount again, still fails due to finobt verifier rejecting a > > > bad btree node, but at a different daddr than what was flagged by > > > xfs_repair. > > > 4. xfs_repair -n finds a bad magic finobt node at the same position as > > > the mount splat. > > > > > > This leads me to wonder if xfs_repair was the poison rather than the > > > cure in this case. After digging into the xfs_repair code I found the > > > following, which looks suspicious, but I don't know enough about XFS > > > internals to tell if this can actually happen: > > > > > > > That does imply xfs_repair is the culprit. If you haven't, could you try > > the xfs_repair -> xfs_repair -n sequence without a mount attempt in the > > picture just to be sure? That may also explain why the block seems to > > shift around. It wouldn't surprise me if it's the same relative block in > > the tree, but IIRC the tree itself can be shifted around run to run. > > > > > build_ino_tree(), which is shared between XFS_BTNUM_INO and > > > XFS_BTNUM_FINO btree rebuilds contains the following in one of the > > > loops: > > > > > > if (lptr->num_recs_pb > 0) > > > prop_ino_cursor(mp, agno, btree_curs, > > > ino_rec->ino_startnum, 0); > > > > > > prop_ino_cursor() calls libxfs_btree_init_block() with the btnum > > > parameter being a fixed XFS_BTNUM_INO. > > > > > > > It's been a while since I've looked at the guts of this code but that > > definitely does not look right. build_ino_tree() is where we abstract > > construction of the different tree types. Nice catch! > > > > If this is the problem, I'll have to take a closer look to figure out > > why this problem isn't more prevalent.. > > > > Hmm... so it looks like this is the code responsible for _growing_ the > interior nodes of the tree (i.e., level > 0). If I'm following the code > correctly, the high level algorithm is to first set up a block for each > level of the tree. This occurs early in build_ino_tree() and uses the > correct btnum. We then start populating the leaf nodes with the finobt > records that have previously been built up in-core. For each leaf block, > we populate the higher level block with the key based on the first > record in the leaf then fill in the remaining records in the leaf. This > repeats until the block at the next level up is full, at which point we > grab a new level > 0 block and initialize it with the inobt magic. > > I haven't confirmed, but I think this basically means that we'd > initialize every intermediate block after the left-most block at each > level in the tree with the inobt magic. I think Dave already recognized > in the other subthread that you have a 2 block level 1 tree. I suspect > the only reasons we haven't run into this until now are that the finobt > is intended to remain fairly small by only tracking free inode records > and thus prioritizing allocation from those records (removing them from > the tree) and this is limited to users who repair a filesystem in such a > state. > I created a large finobt filesystem, confirmed that xfs_repair stamps every node as noted above with the inobt magic, and verified that the patch below fixes the problem. I'll post it to the list shortly. Thanks again. Brian > I'm guessing you have a large enough set of inodes and a sustained > enough deletion pattern in your workload that you eventually free enough > inodes across a large enough set of records to populate a multi-block > level 1 tree. Then as soon as you try to allocate from that side of the > tree the finer grained checks in __xfs_btree_check_[l|s]block() discover > the wrong magic and generate the error. > > The positive here is that this isn't really a coherency problem with the > filesystem. The finobt itself is coherent, consistent with the inobt and > the magic value doesn't really dictate how the trees are used/updated at > runtime. This means that the only real corruption should be the magic > value. > > I am wondering whether this warrants a special case in the magic > checking to convert the runtime internal error into more of a warning to > unmount and run xfs_repair (similar to how we detected the agfl size > mismatch problem). Dave, any thoughts on that? > > Brian > > > > So if it's possible to get into this code path while rebuilding the > > > finobt, xfs_repair will generate a otherwise valid btree node with the > > > wrong magic, which matches what we see. > > > > > > Sadly we have no more debug information available about the state of > > > the FS on the first crash, which lead us to run xfs_repair in the first > > > place. > > > > > > > Naturally, the next question would have been what required repair in the > > first place. :P > > > > Anyways, if you wanted to give this a quick try to confirm it looks like > > a fairly straightforward change. See the appended xfsprogs patch. Note > > again that you can (and should) run this against the restored metadump > > that captures the corruption before running against the primary fs. > > > > Brian > > > > --- 8< --- > > > > diff --git a/repair/phase5.c b/repair/phase5.c > > index 85d1f4fb..05333f11 100644 > > --- a/repair/phase5.c > > +++ b/repair/phase5.c > > @@ -983,7 +983,7 @@ init_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs, > > > > static void > > prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs, > > - xfs_agino_t startino, int level) > > + xfs_btnum_t btnum, xfs_agino_t startino, int level) > > { > > struct xfs_btree_block *bt_hdr; > > xfs_inobt_key_t *bt_key; > > @@ -1005,7 +1005,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs, > > * first path up the left side of the tree > > * where the agbno's are already set up > > */ > > - prop_ino_cursor(mp, agno, btree_curs, startino, level); > > + prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level); > > } > > > > if (be16_to_cpu(bt_hdr->bb_numrecs) == > > @@ -1041,7 +1041,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs, > > lptr->buf_p->b_ops = &xfs_inobt_buf_ops; > > bt_hdr = XFS_BUF_TO_BLOCK(lptr->buf_p); > > memset(bt_hdr, 0, mp->m_sb.sb_blocksize); > > - libxfs_btree_init_block(mp, lptr->buf_p, XFS_BTNUM_INO, > > + libxfs_btree_init_block(mp, lptr->buf_p, btnum, > > level, 0, agno, 0); > > > > bt_hdr->bb_u.s.bb_leftsib = cpu_to_be32(lptr->prev_agbno); > > @@ -1049,7 +1049,7 @@ prop_ino_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_status_t *btree_curs, > > /* > > * propagate extent record for first extent in new block up > > */ > > - prop_ino_cursor(mp, agno, btree_curs, startino, level); > > + prop_ino_cursor(mp, agno, btree_curs, btnum, startino, level); > > } > > /* > > * add inode info to current block > > @@ -1201,7 +1201,7 @@ build_ino_tree(xfs_mount_t *mp, xfs_agnumber_t agno, > > lptr->modulo--; > > > > if (lptr->num_recs_pb > 0) > > - prop_ino_cursor(mp, agno, btree_curs, > > + prop_ino_cursor(mp, agno, btree_curs, btnum, > > ino_rec->ino_startnum, 0); > > > > bt_rec = (xfs_inobt_rec_t *)