Re: [PATCH] VFS/BTRFS/NFSD: provide more unique inode number for btrfs export

Goffredo Baroncelli <kreijack@xxxxxxxxx> · Wed, 18 Aug 2021 19:24:46 +0200

On 8/17/21 11:39 PM, NeilBrown wrote:
On Wed, 18 Aug 2021, kreijack@xxxxxxxxx wrote:
On 8/15/21 11:53 PM, NeilBrown wrote:
On Mon, 16 Aug 2021, kreijack@xxxxxxxxx wrote:
On 8/15/21 9:35 PM, Roman Mamedov wrote:

However looking at the 'exports' man page, it seems that NFS has already an
option to cover these cases: 'crossmnt'.

If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already
exported one) and the "parent" filesystem is marked as 'crossmnt',  the client mount
the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision.

As you acknowledged, you haven't read the whole back-story.  Maybe you
should.

https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@xxxxxxxxxxxx/
https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/
https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/

The flow of conversation does sometimes jump between threads.

I'm very happy to respond you questions after you've absorbed all that.

Hi Neil,

I read the other threads.  And I still have the opinion that the nfsd
crossmnt behavior should be a good solution for the btrfs subvolumes.

Thanks for reading it all.  Let me join the dots for you.

[...]

Alternately we could change the "crossmnt" functionality to treat a
change of st_dev as though it were a mount point.  I posted patches to
do this too.  This hits the same sort of problems in a different way.
If NFSD reports that is has crossed a "mount" by providing a different
filesystem-id to the client, then the client will create a new mount
point which will appear in /proc/mounts.  

Yes, this is my proposal.

It might be less likely that
many thousands of subvolumes are accessed over NFS than locally, but it
is still entirely possible.  

I don't think that it would be so unlikely. Think about a file indexer
and/or a 'find' command runned in the folder that contains the snapshots...

I don't want the NFS client to suffer a
problem that btrfs doesn't impose locally.  

The solution is not easy. In fact we are trying to map a u64 x u64 space to a u64 space. The true is that we
cannot guarantee that a collision will not happen. We can only say that for a fresh filesystem is near
impossible, but for an aged filesystem it is unlikely but possible.

We already faced real case where we exhausted the inode space in the 32 bit arch.What is the chances that the subvolumes ever created count is greater  2^24 and the inode number is greater  2^40 ? The likelihood is low but not 0...

Some random toughs:
- the new inode number are created merging the original inode-number (in the lower bit) and the object-id of the subvolume (in higher bit). We could add a warning when these bits overlap:

	if (fls(stat->ino) >= ffs(stat->ino_uniquifer))
		printk("NFSD: Warning possible inode collision...")

More smarter heuristic can be developed, like doing the check against the maximum value if inode and the maximum value of the subvolume once at mount time....

- for the inode number it is an expensive operation (even tough it exists/existed for the 32bit processor), but we could reuse the object-id after it is freed

- I think that we could add an option to nfsd or btrfs (not a default behavior) to avoid to cross the subvolume boundary

And 'private' subvolumes
could again appear on a public list if they were accessed via NFS.

(wrongly) I never considered  a similar scenario. However I think that these could be anonymized using a alias (the name of the path to mount is passed by nfsd, so it could create an alias that will be recognized by nfsd when the clienet requires it... complex but doable...)

Thanks,
NeilBrown

--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5