Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.

"NeilBrown" <neilb@xxxxxxx> · Tue, 20 Jul 2021 09:53:33 +1000

On Tue, 20 Jul 2021, Josef Bacik wrote:
> On 7/19/21 4:00 PM, J. Bruce Fields wrote:
> > On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:
> >> Ok so setting aside btrfs for the moment, how does NFS deal with
> >> exporting a directory that has multiple other file systems under
> >> that tree?  I assume the same sort of problem doesn't occur, but why
> >> is that?  Is it because it's a different vfsmount/sb or is there
> >> some other magic making this work?  Thanks,
> > 
> > There are two main ways an NFS client can look up a file: by name or by
> > filehandle.  The former's the normal filesystem directory lookup that
> > we're used to.  If the name refers to a mountpoint, the server can cross
> > into the mounted filesystem like anyone else.
> > 
> > It's the lookup by filehandle that's interesting.  Typically the
> > filehandle includes a UUID and an inode number.  The server looks up the
> > UUID with some help from mountd, and that gives a superblock that nfsd
> > can use for the inode lookup.
> > 
> > As Neil says, mountd does that basically by searching among mounted
> > filesystems for one with that uuid.
> > 
> > So if you wanted to be able to handle a uuid for a filesystem that's not
> > even mounted yet, you'd need some new mechanism to look up such uuids.
> > 
> > That's something we don't currently support but that we'd need to
> > support if BTRFS subvolumes were automounted.  (And it might have other
> > uses as well.)
> > 
> > But I'm not entirely sure if that answers your question....
> > 
> 
> Right, because btrfs handles the filehandles ourselves properly with the 
> export_operations and we encode the subvolume id's into those things to make 
> sure we can always do the proper lookup.
> 
> I suppose the real problem is that NFS is exposing the inode->i_ino to the 
> client without understanding that it's on a different subvolume.
> 
> Our trick of simply allocating an anonymous bdev every time you wander into a 
> subvolume to get a unique st_dev doesn't help you guys because you are looking 
> for mounted file systems.
> 
> I'm not concerned about the FH case, because for that it's already been crafted 
> by btrfs and we know what to do with it, so it's always going to be correct.
> 
> The actual problem is that we can do
> 
> getattr(/file1)
> getattr(/snap/file1)
> 
> on the client and the NFS server just blind sends i_ino with the same fsid 
> because / and /snap are the same fsid.
> 
> Which brings us back to what HCH is complaining about.  In his view if we had a 
> vfsmount for /snap then you would know that it was a different fs.  However that 
> would only actually work if we generated a completely different superblock and 
> thus gave /snap a unique fsid, right?

No, I don't think it needs to be a different superblock to have a
vfsmount.  (I don't know if it does to keep HCH happy).

If I "mount --bind /snap /snap" then I've created a vfsmnt with the
upper and lower directories identical - same inode, same superblock.
This is an existence-proof that you don't need a separate super-block.

> 
> If we did the automount thing, and the NFS server went down and came back up and 
> got a getattr(/snap/file1) from a previously generated FH it would still work 
> right, because it would come into the export_operations with the format that 
> btrfs is expecting and it would be able to do the lookup.  This FH lookup would 
> do the automount magic it needs to and then NFS would have the fsid it needs, 
> correct?  Thanks,

Not quite.
An NFS filehandle (as generated by linux-nfsd) has two components (plus
a header).  The filesystem-part and the file-part.
The filesystem-part is managed by userspace (/usr/sbin/mountd).  The
code relies on every filesystem appearing in /proc/self/mounts.
The bytes chosen are either based on the uuid reported by 'libblkid', or the
fsid reported by statfs(), based on a black-list of filesystems for
which libblkid is not useful.  This list includes btrfs.
The file-part is managed in the kernel using export_operations.

For any given 'struct path' in the kernel, a filehandle is generated
(conceptually) by finding the closest vfsmnt (close to inode, far from
root) and asking user-space to map that.  Then passing the inode to the
filesystem and asking it to map that.

So, in your example, if /snap were a mount point, the kernel would ask
mountd to determine the filesystem-part of /snap, and the fact that the
file-part from btrfs contained the objectid for snap just be redundant
information.  If /snap couldn't be found in /proc/self/mounts after a
server restart, the filehandle would be stale.

If btrfs were to use automounts and create the vfsmnts that one might
normally expect, then nfsd would need there to be two different sorts of
mount points, ideally visible in /proc/mounts (maybe a new flag that
appears in the list of mount options? "internal" ??).

- there needs to be the current mountpoint which a expected to be
  present after a reboot, and is likely to introduce a new filesystem,
  and
- there are these "new" mountpoints which are on-demand and expose
  something that is (in some sense) part of the same filesystem.
  The key property that NFSd would depend on is that these mount points
  do NOT introduce a new name-space for file-handles (in the sense of
  export_operations).

To expand on that last point:
- If a filehandle is requested for an inode above the "new" mountpoint
  and another "below" the new mountpoint, they are guaranteed to be
  different.
- If a filehandle that was "below" the new mountpoint is passed to
  exportfs_decode_fh() together with the vfsmnt that was *above* the
  mountpoint, then it somehow does "the right thing".  Probably
  that would require changing exportfs_decode_fh() to return a
  'struct path' rather than just a 'struct dentry *'.

When nfsd detected one of these "internal" mountpoints during a lookup,
it would *not* call-out to user-space to create a new export, but it
*would* ensure that a new fsid was reported for all inodes in the new
vfsmnt.

NeilBrown