Re: [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly

Josef Bacik <josef@xxxxxxxxxxxxxx> · Wed, 28 Jul 2021 17:30:04 -0400

On 7/28/21 3:35 PM, J. Bruce Fields wrote:
I'm still stuck trying to understand why subvolumes can't get their own
superblocks:

	- Why are the performance issues Josef raises unsurmountable?
	  And why are they unique to btrfs?  (Surely there other cases
	  where people need hundreds or thousands of superblocks?)

I don't think anybody has that many file systems.  For btrfs it's a single file 
system.  Think of syncfs, it's going to walk through all of the super blocks on 
the system calling ->sync_fs on each subvol superblock.  Now this isn't a huge 
deal, we could just have some flag that says "I'm not real" or even just have 
anonymous superblocks that don't get added to the global super_blocks list, and 
that would address my main pain points.

The second part is inode reclaim.  Again this particular problem could be 
avoided if we had an anonymous superblock that wasn't actually used, but the 
inode lru is per superblock.  Now with reclaim instead of walking all the 
inodes, you're walking a bunch of super blocks and then walking the list of 
inodes within those super blocks.  You're burning CPU cycles because now instead 
of getting big chunks of inodes to dispose, it's spread out across many super 
blocks.

The other weird thing is the way we apply pressure to shrinker systems.  We 
essentially say "try to evict X objects from your list", which means in this 
case with lots of subvolumes we'd be evicting waaaaay more inodes than you were 
before, likely impacting performance where you have workloads that have lots of 
files open across many subvolumes (which is what FB does with it's containers).

If we want a anonymous superblock per subvolume then the only way it'll work is 
if it's not actually tied into anything, and we still use the primary super 
block for the whole file system.  And if that's what we're going to do what's 
the point of the super block exactly?  This approach that Neil's come up with 
seems like a reasonable solution to me.  Christoph gets his separation and 
/proc/self/mountinfo, and we avoid the scalability headache of a billion super 
blocks.  Thanks,

Josef