On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > Rebuild the reverse mapping btree from all primary metadata. > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> .... > +static inline int xfs_repair_rmapbt_setup( > + struct xfs_scrub_context *sc, > + struct xfs_inode *ip) > +{ > + /* We don't support rmap repair, but we can still do a scan. */ > + return xfs_scrub_setup_ag_btree(sc, ip, false); > +} This comment seems at odds with the commit message.... .... > +/* > + * Reverse Mapping Btree Repair > + * ============================ > + * > + * This is the most involved of all the AG space btree rebuilds. Everywhere > + * else in XFS we lock inodes and then AG data structures, but generating the > + * list of rmap records requires that we be able to scan both block mapping > + * btrees of every inode in the filesystem to see if it owns any extents in > + * this AG. We can't tolerate any inode updates while we do this, so we > + * freeze the filesystem to lock everyone else out, and grant ourselves > + * special privileges to run transactions with regular background reclamation > + * turned off. Hmmm. This implies we are going to scan the entire filesystem for every AG we need to rebuild the rmap tree in. That seems like an awful lot of work if there's more than one rmap btree that needs rebuild. > + * We also have to be very careful not to allow inode reclaim to start a > + * transaction because all transactions (other than our own) will block. What happens when we run out of memory? Inode reclaim will need to run at that point, right? > + * So basically we scan all primary per-AG metadata and all block maps of all > + * inodes to generate a huge list of reverse map records. Next we look for > + * gaps in the rmap records to calculate all the unclaimed free space (1). > + * Next, we scan all other OWN_AG metadata (bnobt, cntbt, agfl) and subtract > + * the space used by those btrees from (1), and also subtract the free space > + * listed in the bnobt from (1). What's left are the gaps in assigned space > + * that the new rmapbt knows about but the existing bnobt doesn't; these are > + * the blocks from the old rmapbt and they can be freed. THis looks like a lot of repeated work. We've already scanned a bunch of these trees to repair them, then thrown away the scan results. Now we do another scan of what we've rebuilt..... ... hold on. Chicken and egg. We verify and rebuild all the other trees from the rmap information - how do we do determine that the rmap needs to rebuilt and that the metadata it's being rebuilt from is valid? Given that we've effetively got to shut down access to the filesystem for the entire rmap rebuild while we do an entire filesystem scan, why would we do this online? It's going to be faster to do this rebuild offline (because of all the prefetching, rebuilding all AG trees from the state gathered in the full filesystem passes, etc) and we don't have to hack around potential transaction and memory reclaim deadlock situations, either? So why do rmap rebuilds online at all? > + */ > + > +/* Set us up to repair reverse mapping btrees. */ > +int > +xfs_repair_rmapbt_setup( > + struct xfs_scrub_context *sc, > + struct xfs_inode *ip) > +{ > + int error; > + > + /* > + * Freeze out anything that can lock an inode. We reconstruct > + * the rmapbt by reading inode bmaps with the AGF held, which is > + * only safe w.r.t. ABBA deadlocks if we're the only ones locking > + * inodes. > + */ > + error = xfs_scrub_fs_freeze(sc); > + if (error) > + return error; > + > + /* Check the AG number and set up the scrub context. */ > + error = xfs_scrub_setup_fs(sc, ip); > + if (error) > + return error; > + > + /* > + * Lock all the AG header buffers so that we can read all the > + * per-AG metadata too. > + */ > + error = xfs_repair_grab_all_ag_headers(sc); > + if (error) > + return error; So if we have thousands of AGs (think PB scale filesystems) then we're going hold many thousands of locked buffers here? Just so we can rebuild the rmapbt in one AG? What does holding these buffers locked protect us against that an active freeze doesn't? > +xfs_repair_rmapbt_new_rmap( > + struct xfs_repair_rmapbt *rr, > + xfs_agblock_t startblock, > + xfs_extlen_t blockcount, > + uint64_t owner, > + uint64_t offset, > + unsigned int flags) > +{ > + struct xfs_repair_rmapbt_extent *rre; > + int error = 0; > + > + trace_xfs_repair_rmap_extent_fn(rr->sc->mp, rr->sc->sa.agno, > + startblock, blockcount, owner, offset, flags); > + > + if (xfs_scrub_should_terminate(rr->sc, &error)) > + return error; > + > + rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL); > + if (!rre) > + return -ENOMEM; This seems like a likely thing to happen given the "no reclaim" state of the filesystem and the memory demand a rmapbt rebuild can have. If we've got GBs of rmap info in the AG that needs to be rebuilt, how much RAM are we going to need to index it all as we scan the filesystem? > +xfs_repair_rmapbt_scan_ifork( > + struct xfs_repair_rmapbt *rr, > + struct xfs_inode *ip, > + int whichfork) > +{ > + struct xfs_bmbt_irec rec; > + struct xfs_iext_cursor icur; > + struct xfs_mount *mp = rr->sc->mp; > + struct xfs_btree_cur *cur = NULL; > + struct xfs_ifork *ifp; > + unsigned int rflags; > + int fmt; > + int error = 0; > + > + /* Do we even have data mapping extents? */ > + fmt = XFS_IFORK_FORMAT(ip, whichfork); > + ifp = XFS_IFORK_PTR(ip, whichfork); > + switch (fmt) { > + case XFS_DINODE_FMT_BTREE: > + if (!(ifp->if_flags & XFS_IFEXTENTS)) { > + error = xfs_iread_extents(rr->sc->tp, ip, whichfork); > + if (error) > + return error; > + } Ok, so we need inodes locked to do this.... .... > +/* Iterate all the inodes in an AG group. */ > +STATIC int > +xfs_repair_rmapbt_scan_inobt( > + struct xfs_btree_cur *cur, > + union xfs_btree_rec *rec, > + void *priv) > +{ > + struct xfs_inobt_rec_incore irec; > + struct xfs_repair_rmapbt *rr = priv; > + struct xfs_mount *mp = cur->bc_mp; > + struct xfs_inode *ip = NULL; > + xfs_ino_t ino; > + xfs_agino_t agino; > + int chunkidx; > + int lock_mode = 0; > + int error = 0; > + > + xfs_inobt_btrec_to_irec(mp, rec, &irec); > + > + for (chunkidx = 0, agino = irec.ir_startino; > + chunkidx < XFS_INODES_PER_CHUNK; > + chunkidx++, agino++) { > + bool inuse; > + > + /* Skip if this inode is free */ > + if (XFS_INOBT_MASK(chunkidx) & irec.ir_free) > + continue; > + ino = XFS_AGINO_TO_INO(mp, cur->bc_private.a.agno, agino); > + > + /* Back off and try again if an inode is being reclaimed */ > + error = xfs_icache_inode_is_allocated(mp, cur->bc_tp, ino, > + &inuse); > + if (error == -EAGAIN) > + return -EDEADLOCK; And we can get inode access errors here..... FWIW, how is the inode being reclaimed if the filesystem is frozen? > + > + /* > + * Grab inode for scanning. We cannot use DONTCACHE here > + * because we already have a transaction so the iput must not > + * trigger inode reclaim (which might allocate a transaction > + * to clean up posteof blocks). > + */ > + error = xfs_iget(mp, cur->bc_tp, ino, 0, 0, &ip); So if there are enough inodes in the AG, we'll run out of memory here because we aren't reclaiming inodes from the cache but instead putting them all on the defferred iput list? > + if (error) > + return error; > + trace_xfs_scrub_iget(ip, __this_address); > + > + if ((ip->i_d.di_format == XFS_DINODE_FMT_BTREE && > + !(ip->i_df.if_flags & XFS_IFEXTENTS)) || > + (ip->i_d.di_aformat == XFS_DINODE_FMT_BTREE && > + !(ip->i_afp->if_flags & XFS_IFEXTENTS))) > + lock_mode = XFS_ILOCK_EXCL; > + else > + lock_mode = XFS_ILOCK_SHARED; > + if (!xfs_ilock_nowait(ip, lock_mode)) { > + error = -EBUSY; > + goto out_rele; > + } And in what situation do we get inodes stuck with the ilock held on frozen filesysetms? .... > +out_unlock: > + xfs_iunlock(ip, lock_mode); > +out_rele: > + iput(VFS_I(ip)); > + return error; calling iput in the error path is a bug - it will trigger all the paths you're trying to avoid by using te deferred iput list. .... > +/* Collect rmaps for all block mappings for every inode in this AG. */ > +STATIC int > +xfs_repair_rmapbt_generate_aginode_rmaps( > + struct xfs_repair_rmapbt *rr, > + xfs_agnumber_t agno) > +{ > + struct xfs_scrub_context *sc = rr->sc; > + struct xfs_mount *mp = sc->mp; > + struct xfs_btree_cur *cur; > + struct xfs_buf *agi_bp; > + int error; > + > + error = xfs_ialloc_read_agi(mp, sc->tp, agno, &agi_bp); > + if (error) > + return error; > + cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, agno, XFS_BTNUM_INO); > + error = xfs_btree_query_all(cur, xfs_repair_rmapbt_scan_inobt, rr); So if we get a locked or reclaiming inode anywhere in the filesystem we see EDEADLOCK/EBUSY here without having scanned all the inodes in the AG, right? > + xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR); > + xfs_trans_brelse(sc->tp, agi_bp); > + return error; > +} > + > +/* > + * Generate all the reverse-mappings for this AG, a list of the old rmapbt > + * blocks, and the new btreeblks count. Figure out if we have enough free > + * space to reconstruct the inode btrees. The caller must clean up the lists > + * if anything goes wrong. > + */ > +STATIC int > +xfs_repair_rmapbt_find_rmaps( > + struct xfs_scrub_context *sc, > + struct list_head *rmap_records, > + xfs_agblock_t *new_btreeblks) > +{ > + struct xfs_repair_rmapbt rr; > + xfs_agnumber_t agno; > + int error; > + > + rr.rmaplist = rmap_records; > + rr.sc = sc; > + rr.nr_records = 0; > + > + /* Generate rmaps for AG space metadata */ > + error = xfs_repair_rmapbt_generate_agheader_rmaps(&rr); > + if (error) > + return error; > + error = xfs_repair_rmapbt_generate_log_rmaps(&rr); > + if (error) > + return error; > + error = xfs_repair_rmapbt_generate_freesp_rmaps(&rr, new_btreeblks); > + if (error) > + return error; > + error = xfs_repair_rmapbt_generate_inobt_rmaps(&rr); > + if (error) > + return error; > + error = xfs_repair_rmapbt_generate_refcountbt_rmaps(&rr); > + if (error) > + return error; > + > + /* Iterate all AGs for inodes rmaps. */ > + for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) { > + error = xfs_repair_rmapbt_generate_aginode_rmaps(&rr, agno); > + if (error) > + return error; And that means we abort here.... > +/* Repair the rmap btree for some AG. */ > +int > +xfs_repair_rmapbt( > + struct xfs_scrub_context *sc) > +{ > + struct xfs_owner_info oinfo; > + struct list_head rmap_records; > + xfs_extlen_t new_btreeblks; > + int log_flags = 0; > + int error; > + > + xfs_scrub_perag_get(sc->mp, &sc->sa); > + > + /* Collect rmaps for all AG headers. */ > + INIT_LIST_HEAD(&rmap_records); > + xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_UNKNOWN); > + error = xfs_repair_rmapbt_find_rmaps(sc, &rmap_records, &new_btreeblks); > + if (error) > + goto out; And we drop out here. So, essentially, any ENOMEM, locked inode or inode in reclaim anywhere in the filesystem will prevent rmap rebuild. Which says to me that rebuilding the rmap on on any substantial filesystem is likely to fail. Which brings me back to my original question: why attempt to do rmap rebuild online given how complex it is, the performance implications of a full filesystem scan per AG that needs rebuild, and all the ways it could easily fail? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html