On Thu, Sep 05, 2024 at 08:34:39AM GMT, Dave Chinner wrote: > I've seen xfs_repair require a couple of TB of RAM to repair > metadata heavy filesystems of relatively small size (sub-20TB). > Once you get about a few hundred GB of metadata in the filesystem, > the fsck cross-reference data set size can easily run into the TBs. > > So 256GB might *seem* like a lot of memory, but we were seeing > xfs_repair exceed that amount of RAM for metadata heavy filesystems > at least a decade ago... > > Indeed, we recently heard about a 6TB filesystem with 15 *billion* > hardlinks in it. The cross reference for resolving all those > hardlinks would require somewhere in the order of 1.5TB of RAM to > hold. The only way to reliably handle random access data sets this > large is with pageable memory.... Christ... This is also where space efficiency of metadata starts to really matter. Of course you store full backreferences for every hardlink, which is nice in some ways and a pain in others. > > Another more pressing one is the extents -> backpointers and > > backpointers -> extents passes of fsck; we do a linear scan through one > > btree checking references to another btree. For the btree we're checking > > references to the lookups are random, so we need to cache and pin the > > entire btree in ram if possible, or if not whatever will fit and we run > > in multiple passes. > > > > This is the #1 scalability issue hitting a number of users right now, so > > I may need to rewrite it to pull backpointers into an eytzinger array > > and do our random lookups for backpointers on that - but that will be > > "the biggest vmalloc array we can possible allocate", so the INT_MAX > > size limit is clearly an issue there... > > Given my above comments, I think you are approaching this problem > the wrong way. It is known that the data set that can exceed > physical kernel memory size, hence it needs to be swappable. That > way users can extend the kernel memory capacity via swapfiles when > bcachefs.fsck needs more memory than the system has physical RAM. Well, it depends on the locality of the cross references - I don't think we want to go that route here, because if there isn't any locality in the cross references we'll just be thrashing; better to run in multiple passes, constraining each pass to what _will_ fit in ram... It would be nice if we had a way to guesstimate locality in extents <-> backpointers references - if there is locality, then it's better to just run in one pass - and we wouldn't bother with building up new tables, we'd just rely on the btree node cache. Perhaps that's what we'll do when online fsck is finished and we're optimizing more for "don't disturb the rest of the system too much" than "get it done as quick as possible". I do need to start making use of Darrick's swappable memory code in at least one other place though - the bucket tables when we're checking basic allocation info. That one just exceeded the INT_MAX limit for a user with a 30 TB hard drive, so I switched it to a radix tree for now, but it really should be swappable memory. Fortunately there's more locality in the accesses there. > Hence Darrick designed and implemented pageable shmem backed memory > files (xfiles) to hold these data sets. Hence the size limit of the > online repair data set is physical RAM + swap space, same as it is > for offline repair. You can find the xfile code in > fs/xfs/scrub/xfile.[ch]. > > Support for large, sortable arrays of fixed size records built on > xfiles can be found in xfarray.[ch], and blob storage in > xfblob.[ch]. *nod* I do wish we had normal virtually mapped swappable memory though - the thing I don't like about xfarray is that it requires a radix tree walk on every access, and we have _hardware_ that's meant to do that for us. But if you still care about 32 bit then that does necessitate Darrick's approach. I'm willing to consider 32 bit legacy for bcachefs, though.