Re: [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Fri, 30 Jul 2021 15:09:12 +0800

On 2021/7/30 下午2:53, NeilBrown wrote:
On Fri, 30 Jul 2021, Qu Wenruo wrote:

You mean like "du -x"?? Yes.  You would lose the misleading illusion
that there are multiple filesystems.  That is one user-expectation that
would need to be addressed before people opt-in

OK, forgot it's an opt-in feature, then it's less an impact.

The hope would have to be that everyone would eventually opt-in once all
issues were understood.

Really not familiar with NFS/VFS, thus some ideas from me may sounds
super crazy.

Is it possible that, for nfsd to detect such "subvolume" concept by its
own, like checking st_dev and the fsid returned from statfs().

Then if nfsd find some boundary which has different st_dev, but the same
fsid as its parent, then it knows it's a "subvolume"-like concept.

Then do some local inode number mapping inside nfsd?
Like use the highest 20 bits for different subvolumes, while the
remaining 44 bits for real inode numbers.

Of-course, this is still a workaround...

Yes, it would certainly be possible to add some hacks to nfsd to fix the
immediate problem, and we could probably even created some well-defined
interfaces into btrfs to extract the required information so that it
wasn't too hackish.

Maybe that is what we will have to do.  But I'd rather not hack NFSD
while there is any chance that a more complete solution will be found.

I'm not quite ready to give up on the idea of squeezing all btrfs inodes
into a 64bit number space.  24bits of subvol and 40 bits of inode?
Make the split a mkfs or mount option?

Btrfs used to have a subvolume number limit in the past, for different
reasons.

In that case, subvolume number is limited to 48 bits, which is still too
large to avoid conflicts.

For inode number there is really no limit except the 256 ~ (U64)-256 limit.

Considering all these numbers are almost U64, conflicts would be
unavoidable AFAIK.

Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol
(which has ever been accessed) has a mapping from the top 32 bits of the
objectid to the top 32 bits of the inode number.

We don't need something that is theoretically perfect (that's not
possible anyway as we don't have 64bits of device numbers).  We just
need something that is practical and scales adequately.  If you have
petabytes of storage, it is reasonable to spend a gigabyte of memory on
a lookup table(?).

Can such squishing-all-inodes-into-one-namespace work to be done in a
more generic way? e.g, let each fs with "subvolume"-like feature to
provide the interface to do that.

Despite that I still hope to have a way to distinguish the "subvolume"
boundary.

If completely inside btrfs, it's pretty simple to locate a subvolume
boundary.
All subvolume have the same inode number 256.

Maybe we could reserve some special "squished" inode number to indicate
boundary inside a filesystem.

E.g. reserve (u64)-1 as a special indicator for subvolume boundaries.
As most fs would have reserved super high inode numbers anyway.

If we can make inode numbers unique, we can possibly leave the st_dev
changing at subvols so that "du -x" works as currently expected.

One thought I had was to use a strong hash to combine the subvol object
id and the inode object id into a 64bit number.  What is the chance of
a collision in practice :-)

But with just 64bits, conflicts will happen anyway...

Thanks,
Qu

Thanks,
NeilBrown