Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/15/21 12:37 AM, NeilBrown wrote:

Hi all,
  the problem this patch address has been discuss on both the NFS list
  and the BTRFS list, so I'm sending this to both.  I'd be very happy for
  people experiencing the problem (NFS export of BTRFS subvols) who are
  in a position to rebuild the kernel on their NFS server to test this
  and report success (or otherwise).

  While I've tried to write this patch so that it *could* land upstream
  (and could definitely land in a distro franken-kernel if needed), I'm
  not completely sure it *should* land upstream.  It includes some deep
  knowledge of BTRFS into NFSD code.  This could be removed later once
  proper APIs are designed and provided.  I can see arguments either way
  and wonder what others think.

  BTRFS developers:  please examine the various claims I have made about
    BTRFS and correct any that are wrong.  The observation that
    getdents can report the same inode number of unrelated files
    (a file and a subvol in my case) is ... interesting.

  NFSD developers: please comment on anything else.

  Others: as I said: testing would be great! :-)

Subject: [PATCH] NFSD: handle BTRFS subvolumes better.

A single BTRFS mount can present as multiple "volumes".  i.e. multiple
sets of objects with potentially overlapping inode number spaces.
The st_dev presented to user-space via the stat(2) family of calls is
different for each internal volume, as is the f_fsid reported by
statfs().

However nfsd doesn't look at st_dev or the fsid (other than for the
export point - typically the mount point), so it doesn't notice the
different filesystems.  Importantly, it doesn't report a different fsid
to the NFS client.

This leads to the NFS client reusing inode numbers, and applications
like "find" and "du" complaining, particularly when they find a
directory with the same st_ino and st_dev as an ancestor.  This
typically happens with the root of a sub-volume as the root of every
volume in BTRFS has the same inode number (256).

To fix this, we need to report a different fsid for each subvolume, but
need to use the same fsid that we currently use for the top-level
volume.  Changing this (by rebooting a server to new code), might
confuse the client.  I don't think it would be a major problem (stale
filehandles shouldn't happen), but it is best avoided.

Determining the fsid to use is a bit awkward....

There is limited space in the protocol (32 bits for NFSv3, 64 for NFSv4)
so we cannot append the subvolume fsid.  The best option seems to be to
hash it in.  This patch uses a simple 'xor', but possible a Jenkins hash
would be better.

For BTRFS (and other) filesystems the current fsid is a hash (xor) of
the uuid provided from userspace by mounted.  This is derived from the
statfs fsid.  If we use the statfs fsid for subvolumes and xor this in,
we risk erasing useful unique information.  So I have chosen not to use
the statfs fsid.

Ideally we should have an API for the filesystem to report if it uses
multiple subvolumes, and to provide a unique identifier for each.  For
now, this patch calls exportfs_encode_fh().  If the returned fsid type
is NOT one of those used by BTRFS, then we assume the st_fsid cannot
change, and use the current behaviour.

If the type IS one that BTRFS uses, we use intimate knowledge of BTRFS
to extract the root_object_id from the filehandle and record that with
the export information.  Then when exporting an fsid, we check if
subvolumes are enabled and if the current dentry has a different
root_object_id to the exported volume.  If it does, the root_object_id
is hashed (xor) into the reported fsid.

When an NFSv4 client sees that the fsid has changed, it will ask for the
MOUNTED_ON_FILEID.  With the Linux NFS client, this is visible to
userspace as an automount point, until content within the directory is
accessed and the automount is triggered.  Currently the MOUNTED_ON_FILEID
for these subvolume roots is the same as of the root - 256.  This will
cause find et.al.  to complain until the automount actually gets mounted.

So this patch reports the MOUNTED_OF_FILEID in such cases to be a magic
number that appears to be appropriate for BTRFS:
     BTRFS_FIRST_FREE_OBJECTID - 1

Again, we really want an API to get this from the filesystem.  Changing
it later has no cost, so we don't need any commitment from the btrfs team
that this is what they will provide if/when we do get such an API.

This same problem (of an automount point with a duplicate inode number)
also exists for NFSv3.  This problem cannot be resolved completely on
the server as NFSv3 doesn't have a well defined "MOUNTED_ON_FILEID"
concept, but we can come close.  The inode number returned by READDIR is
likely to be the mounted-on-fileid.  With READDIR_PLUS, two fileids are
returned, the one from the readdir, and (optionally) another from
'stat'.  Linux-NFS checks these match and if not, it treats the first as
a mounted-on-fileid.

Interestingly BTRFS getdents() *DOES* report a different inode number
for subvol roots than is returned by stat().  These aren't actually
unique (!!!!) but in at least one case, they are different from
ancestors, so this is sufficient.

NFSD currently SUPPRESSES the stat information if the inode number is
different.  This is because there is room for a file to be renamed between
the readdir call and the lookup_one_len() prior to getattr, and the
results could be confusing.  However for the case of a BTRFS filesystem
with an inode number of 256, the value of reporting the difference seems
to exceed the cost of any confusion caused by a race (if that is even
possible in this case).
So this patch allows the two fileids to be different when 256 is found
on BTRFS.

With this patch a 'du' or 'find' in an NFS-mounted btrfs filesystem
which has snapshot subvols works correctly for both NFSv4 and NFSv3.
Fortunately the problematic programs tend to trigger READDIR_PLUS and so
benefit from the detection of the MOUNTED_ON_FILEID which is provides.

Signed-off-by: NeilBrown <neilb@xxxxxxx>

I'm going to restate what I think the problem is you're having just so I'm sure we're on the same page.

1. We export a btrfs volume via nfsd that has multiple subvolumes.
2. We run find, and when we stat a file, nfsd doesn't send along our bogus st_dev, it sends it's own thing (I assume?). This confuses du/find because you get the same inode number with different parents.

Is this correct? If that's the case then it' be relatively straightforward to add another callback into export_operations to grab this fsid right? Hell we could simply return the objectid of the root since that's unique across the entire file system. We already do our magic FH encoding to make sure we keep all this straight for NFS, another callback to give that info isn't going to kill us. Thanks,

Josef



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux