Re: [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export

Goffredo Baroncelli <kreijack@xxxxxxxxx> · Sun, 15 Aug 2021 23:03:19 +0200

On 8/15/21 9:35 PM, Roman Mamedov wrote:
On Sun, 15 Aug 2021 09:39:08 +0200
Goffredo Baroncelli <kreijack@xxxxxxxxx> wrote:

I am sure that it was discussed already but I was unable to find any track
of this discussion. But if the problem is the collision between the inode
number of different subvolume in the nfd export, is it simpler if the export
is truncated to the subvolume boundary ? It would be more coherent with the
current behavior of vfs+nfsd.

See this bugreport thread which started it all:
https://www.spinics.net/lists/linux-btrfs/msg111172.html

In there the reporting user replied that it is strongly not feasible for them
to export each individual snapshot.

Thanks for pointing that.

However looking at the 'exports' man page, it seems that NFS has already an
option to cover these cases: 'crossmnt'.

If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already
exported one) and the "parent" filesystem is marked as 'crossmnt',  the client mount
the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision.

I tested it mounting two nested ext4 filesystem, and there isn't any problem of inode collision
(even if there are two different files with the same inode number).

---------
# mount -o loop disk2 test3/
# echo 123 >test3/one
# mkdir test3/test4
# sudo mount -o loop disk3 test3/test4/
# echo 123 >test3/test4/one
# ls -liR test3/
test3/:
total 24
11 drwx------ 2 root  root  16384 Aug 15 22:27 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:29 one
 2 drwxr-xrwx 3 root  root   4096 Aug 15 22:46 test4

test3/test4:
total 20
11 drwx------ 2 root  root  16384 Aug 15 22:45 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:46 one

# egrep test3 /etc/exports
/tmp/test3 *(rw,no_subtree_check,crossmnt)

# mount 192.168.1.27:/tmp/test3 3
ls -lRi 3
3:
total 24
11 drwx------ 2 root  root  16384 Aug 15 22:27 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:29 one
 2 drwxr-xrwx 3 root  root   4096 Aug 15 22:46 test4

3/test4:
total 20
11 drwx------ 2 root  root  16384 Aug 15 22:45 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:46 one

# mount | egrep 192
192.168.1.27:/tmp/test3 on /tmp/3 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.27,local_lock=none,addr=192.168.1.27)
192.168.1.27:/tmp/test3/test4 on /tmp/3/test4 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.27,local_lock=none,addr=192.168.1.27)

---------------

I tried to mount even with "nfsvers=3", and it seems to work.

However the tests above are related to ext4; in fact it doesn't work with btrfs, but I think this is more a implementation problem than a strategy problem.
What I means is that NFS already has a way to mount different parts of the fs-tree with different mounts (form a client POV). I think that this strategy should be used when NFSd exports a BTRFS filesystem:
- if the 'crossmnt' is NOT Passed, the export should ends at the subvolume boundary (or allow inode collision)
- if the 'crossmnt' is passed, the client should automatically mounts each nested subvolume as a separate mount

In fact in btrfs a subvolume is a complete filesystem, with an "own
synthetic" device. We could like or not this solution, but this solution is
the more aligned to the unix standard, where for each filesystem there is a
pair (device, inode-set). NFS (by default) avoids to cross the boundary
between the filesystems. So why in BTRFS this should be different ?

 From the user point of view subvolumes are basically directories; that they
are "complete filesystems"* is merely a low-level implementation detail.

* well except they are not, as you cannot 'dd' a subvolume to another
blockdevice.

Why don't rename "ino_uniquifier" as "ino_and_subvolume" and leave to the
filesystem the work to combine the inode and the subvolume-id ?

I am worried that the logic is split between the filesystem, which
synthesizes the ino_uniquifier, and to NFS which combine to the inode. I am
thinking that this combination is filesystem specific; for BTRFS is a simple
xor but for other filesystem may be a more complex operation, so leaving an
half in the filesystem and another half to the NFS seems to not optimal if
other filesystem needs to use ino_uniquifier.

I wondered a bit myself, what are the downsides of just doing the
uniquefication inside Btrfs, not leaving that to NFSD?

I mean not even adding the extra stat field, just return the inode itself with
that already applied. Surely cannot be any worse collision-wise, than
different subvolumes straight up having the same inode numbers as right now?

Or is it a performance concern, always doing more work, for something which
only NFSD has needed so far.

--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5