Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.

Josef Bacik <josef@xxxxxxxxxxxxxx> · Mon, 19 Jul 2021 16:44:00 -0400

On 7/19/21 4:00 PM, J. Bruce Fields wrote:
On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
Ok so setting aside btrfs for the moment, how does NFS deal with
exporting a directory that has multiple other file systems under
that tree?  I assume the same sort of problem doesn't occur, but why
is that?  Is it because it's a different vfsmount/sb or is there
some other magic making this work?  Thanks,

There are two main ways an NFS client can look up a file: by name or by
filehandle.  The former's the normal filesystem directory lookup that
we're used to.  If the name refers to a mountpoint, the server can cross
into the mounted filesystem like anyone else.

It's the lookup by filehandle that's interesting.  Typically the
filehandle includes a UUID and an inode number.  The server looks up the
UUID with some help from mountd, and that gives a superblock that nfsd
can use for the inode lookup.

As Neil says, mountd does that basically by searching among mounted
filesystems for one with that uuid.

So if you wanted to be able to handle a uuid for a filesystem that's not
even mounted yet, you'd need some new mechanism to look up such uuids.

That's something we don't currently support but that we'd need to
support if BTRFS subvolumes were automounted.  (And it might have other
uses as well.)

But I'm not entirely sure if that answers your question....

Right, because btrfs handles the filehandles ourselves properly with the 
export_operations and we encode the subvolume id's into those things to make 
sure we can always do the proper lookup.

I suppose the real problem is that NFS is exposing the inode->i_ino to the 
client without understanding that it's on a different subvolume.

Our trick of simply allocating an anonymous bdev every time you wander into a 
subvolume to get a unique st_dev doesn't help you guys because you are looking 
for mounted file systems.

I'm not concerned about the FH case, because for that it's already been crafted 
by btrfs and we know what to do with it, so it's always going to be correct.

The actual problem is that we can do

getattr(/file1)
getattr(/snap/file1)

on the client and the NFS server just blind sends i_ino with the same fsid 
because / and /snap are the same fsid.

Which brings us back to what HCH is complaining about.  In his view if we had a 
vfsmount for /snap then you would know that it was a different fs.  However that 
would only actually work if we generated a completely different superblock and 
thus gave /snap a unique fsid, right?

If we did the automount thing, and the NFS server went down and came back up and 
got a getattr(/snap/file1) from a previously generated FH it would still work 
right, because it would come into the export_operations with the format that 
btrfs is expecting and it would be able to do the lookup.  This FH lookup would 
do the automount magic it needs to and then NFS would have the fsid it needs, 
correct?  Thanks,

Josef