Re: A Third perspective on BTRFS nfsd subvol dev/inode number issues.

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Wed, 4 Aug 2021 06:29:43 +0800

On 2021/8/2 下午9:53, Josef Bacik wrote:
On 8/2/21 3:54 AM, Amir Goldstein wrote:
On Mon, Aug 2, 2021 at 8:41 AM NeilBrown <neilb@xxxxxxx> wrote:

On Mon, 02 Aug 2021, Al Viro wrote:
On Mon, Aug 02, 2021 at 02:18:29PM +1000, NeilBrown wrote:

It think we need to bite-the-bullet and decide that 64bits is not
enough, and in fact no number of bits will ever be enough.  overlayfs
makes this clear.

Sure - let's go for broke and use XML.  Oh, wait - it's 8 months too
early...

So I think we need to strongly encourage user-space to start using
name_to_handle_at() whenever there is a need to test if two things are
the same.

... and forgetting the inconvenient facts, such as that two different
fhandles may correspond to the same object.

Can they?  They certainly can if the "connectable" flag is passed.
name_to_handle_at() cannot set that flag.
nfsd can, so using name_to_handle_at() on an NFS filesystem isn't quite
perfect.  However it is the best that can be done over NFS.

Or is there some other situation where two different filehandles can be
reported for the same inode?

Do you have a better suggestion?

Neil,

I think the plan of "changing the world" is not very realistic.
Sure, *some* tools can be changed, but all of them?

I went back to read your initial cover letter to understand the
problem and what I mostly found there was that the view of
/proc/x/mountinfo was hiding information that is important for
some tools to understand what is going on with btrfs subvols.

Well I am not a UNIX history expert, but I suppose that
/proc/PID/mountinfo was created because /proc/mounts and
/proc/PID/mounts no longer provided tool with all the information
about Linux mounts.

Maybe it's time for a new interface to query the more advanced
sb/mount topology? fsinfo() maybe? With mount2 compatible API for
traversing mounts that is not limited to reporting all entries inside
a single page. I suppose we could go for some hierarchical view
under /proc/PID/mounttree. I don't know - new API is hard.

In any case, instead of changing st_dev and st_ino or changing the
world to work with file handles, why not add inode generation (and
maybe subvol id) to statx().
filesystem that care enough will provide this information and tools that
care enough will use it.

Can y'all wait till I'm back from vacation, goddamn ;)

This is what I'm aiming for, I spent some time looking at how many
places we string parse /proc/<whatever>/mounts and my head hurts.

Btrfs already has a reasonable solution for this, we have UUID's for
everything.  UUID's aren't a strictly btrfs thing either, all the file
systems have some sort of UUID identifier, hell its built into blkid.  I
would love if we could do a better job about letting applications query
information about where they are.  And we could expose this with the
relatively common UUID format.  You ask what fs you're in, you get the
FS UUID, and then if you're on Btrfs you get the specific subvolume UUID
you're in.  That way you could do more fancy things like know if you've
wandered into a new file system completely or just a different subvolume.

I'm completely on the side of using proper UUID.

But suddenly I find a problem for this, at least we still need something
like st_dev for real volume based snapshot.

One of the problem for real volume based snapshot is, the snapshoted
volume is completely the same filesystem, every binary is the same,
including UUID.

That means, the only way to distinguish such volumes is by st_dev.

For such pure UUID base solution, it's in fact unable to distinguish
them using just UUID.
Unless we have some device UUID to replace the old st_dev.

Thanks,
Qu

We have to keep the st_ino/st_dev thing for backwards compatibility, but
make it easier to get more info out of the file system.

We could in theory expose just the subvolid also, since that's a nice
simple u64, but it limits our ability to do new fancy shit in the
future.  It's not a bad solution, but like I said I think we need to
take a step back and figure out what problem we're specifically trying
to solve, and work from there.  Starting from automounts and working our
way back is not going very well.  Thanks,

Josef