On Tue, Jul 03, 2018 at 04:59:01PM -0700, Darrick J. Wong wrote: > On Tue, Jul 03, 2018 at 03:32:00PM +1000, Dave Chinner wrote: > > On Sun, Jun 24, 2018 at 12:24:38PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > Rebuild the reverse mapping btree from all primary metadata. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > .... > > > > > +static inline int xfs_repair_rmapbt_setup( > > > + struct xfs_scrub_context *sc, > > > + struct xfs_inode *ip) > > > +{ > > > + /* We don't support rmap repair, but we can still do a scan. */ > > > + return xfs_scrub_setup_ag_btree(sc, ip, false); > > > +} > > > > This comment seems at odds with the commit message.... > > This is the Kconfig shim needed if CONFIG_XFS_ONLINE_REPAIR=n. Ok, that wasn't clear from the patch context. > > > + * This is the most involved of all the AG space btree rebuilds. Everywhere > > > + * else in XFS we lock inodes and then AG data structures, but generating the > > > + * list of rmap records requires that we be able to scan both block mapping > > > + * btrees of every inode in the filesystem to see if it owns any extents in > > > + * this AG. We can't tolerate any inode updates while we do this, so we > > > + * freeze the filesystem to lock everyone else out, and grant ourselves > > > + * special privileges to run transactions with regular background reclamation > > > + * turned off. > > > > Hmmm. This implies we are going to scan the entire filesystem for > > every AG we need to rebuild the rmap tree in. That seems like an > > awful lot of work if there's more than one rmap btree that needs > > rebuild. > > [some of this Dave and I discussed on IRC, so I'll summarize for > everyone else here...] .... > > Given that we've effetively got to shut down access to the > > filesystem for the entire rmap rebuild while we do an entire > > filesystem scan, why would we do this online? It's going to be > > faster to do this rebuild offline (because of all the prefetching, > > rebuilding all AG trees from the state gathered in the full > > filesystem passes, etc) and we don't have to hack around potential > > transaction and memory reclaim deadlock situations, either? > > > > So why do rmap rebuilds online at all? > > The thing is, xfs_scrub will warm the xfs_buf cache during phases 2 and > 3 while it checks everything. By the time it gets to rmapbt repairs > towards the end of phase 4 (if there's enough memory) those blocks will > still be in cache and online repair doesn't have to wait for the disk. Therein lies the problem: "if there's enough memory". If there's enough memory to cache at the filesystem metadata, track all the bits repair needs to track, and there's no other memory pressure then it will hit the cache. But populating that cache is still going to be slower than an offline repair because IO patterns (see below) and there is competing IO from other work being done on the system (i.e. online repair competes for IO resources and memory resources). As such, I don't see that we're going to have everything we need cached for any significantly sized or busy filesystem, and that means we actually have to care about how much IO online repair algorithms require. We also have to take into account that much of that IO is going to be synchronous single metadata block reads. This will be a limitation on any sort of high IO latency storage (spinning rust, network based block devices, slow SSDs, etc). > If instead you unmount and run xfs_repair then xfs_repair has to reload > all that metadata and recheck it, all of which happens with the fs > offline. xfs_repair has all sorts of concurrency and prefetching optimisations that allow it to scan and process metadata orders of magnitude faster than online repair, especially on slow storage. i.e. online repair is going to be IO seek bound, while offline repair is typically IO bandwidth and/or CPU bound. Offline repair can do full filesystem metadata scans measured in GB/s; as long as online repair does serialised synchronous single structure walks it will be orders of magnitude slower than an offline repair. > So except for the extra complexity of avoiding deadlocks (which I > readily admit is not a small task) I at least don't think it's a > clear-cut downtime win to rely on xfs_repair. Back then - as it is still now - I couldn't see how the IO load required by synchronous full filesystem scans one structure at a time was going to reduce filesystem downtime compared to an offline repair doing optimised "all metadata types at once" concurrent linear AG scans. Keep in mind that online repair will never guarantee that it can fix all problems, so we're always going to have to offline repair. iWhat we want to achieve is minimising downtime for users when a repair is required. With the above IO limitations in mind, I've always considered that online repair would just be for all the simple, quick, easy to fix stuff, because complex stuff that required huge amounts of RAM and full filesystem scans to resolve would always be done faster offline. That's why I think that offline repair will be a better choice for users for the forseeable future if repairing the damage requires full filesystem metadata scans. > > > + > > > + rre = kmem_alloc(sizeof(struct xfs_repair_rmapbt_extent), KM_MAYFAIL); > > > + if (!rre) > > > + return -ENOMEM; > > > > This seems like a likely thing to happen given the "no reclaim" > > state of the filesystem and the memory demand a rmapbt rebuild > > can have. If we've got GBs of rmap info in the AG that needs to be > > rebuilt, how much RAM are we going to need to index it all as we > > scan the filesystem? > > More than I'd like -- at least 24 bytes per record (which at least is no > larger than the size of the on-disk btree) plus a list_head until I can > move the repairers away from creating huge lists. Ok, it kinda sounds a bit like we need to be able to create the new btree on the fly, rather than as a single operation at the end. e.g. if the list builds up to, say, 100k records, we push them into the new tree and can free them. e.g. can we iteratively build the new tree on disk as we go, then do a root block swap at the end to switch from the old tree to the new tree? If that's a potential direction, then maybe we can look at this as a future direction? It also leads to the posibility of pausing/continuing repair from where the last chunk of records were processed, so if we do run out of memory we don't have to start from the beginning again? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html