Re: [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly

Forza <forza@xxxxxxxxxxxx> · Fri, 30 Jul 2021 18:25:06 +0200

On 2021-07-30 17:48, Josef Bacik wrote:
On 7/30/21 11:17 AM, J. Bruce Fields wrote:
On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote:
OK, forgot it's an opt-in feature, then it's less an impact.

But it can still sometimes be problematic.

E.g. if the user want to put some git code into one subvolume, while
export another subvolume through NFS.

Then the user has to opt-in, affecting the git subvolume to lose the
ability to determine subvolume boundary, right?

Totally naive question: is it be possible to treat different subvolumes
differently, and give the user some choice at subvolume creation time
how this new boundary should behave?

It seems like there are some conflicting priorities that can only be
resolved by someone who knows the intended use case.

This is the crux of the problem.  We have no real interfaces or anything 
to deal with this sort of paradigm.  We do the st_dev thing because 
that's the most common way that tools like find or rsync use to 
determine they've wandered into a "different" volume.  This exists 
specifically because of usescases like Zygo's, where he's taking 
thousands of snapshots and manually excluding them from find/rsync is 
just not reasonable.

We have no good way to give the user information about what's going on, 
we just have these old shitty interfaces.  I asked our guys about 
filling up /proc/self/mountinfo with our subvolumes and they had a heart 
attack because we have around 2-4k subvolumes on machines, and with 
monitoring stuff in place we regularly read /proc/self/mountinfo to 
determine what's mounted and such.

And then there's NFS which needs to know that it's walked into a new 
inode space.

This is all super shitty, and mostly exists because we don't have a good 
way to expose to the user wtf is going on.

Personally I would be ok with simply disallowing NFS to wander into 
subvolumes from an exported fs.  If you want to export subvolumes then 
export them individually, otherwise if you walk into a subvolume from 
NFS you simply get an empty directory.

This doesn't solve the mountinfo problem where a user may want to figure 
out which subvol they're in, but this is where I think we could address 
the issue with better interfaces.  Or perhaps Neil's idea to have a 
common major number with a different minor number for every subvol.

Either way this isn't as simple as shoehorning it into automount and 
being done with it, we need to take a step back and think about how 
should this actually look, taking into account we've got 12 years of 
having Btrfs deployed with existing usecases that expect a certain 
behavior.  Thanks,

Josef

As a user and sysadmin I really appreciate the way Btrfs currently works.

We use hourly snapshots which are exposed over Samba as "Previous 
Versions" to Windows users. This amounts to thousands of snapshots, all 
user serviceable. A great feature!

In Samba world we have a mount option[1] called "noserverino" which lets 
the client generate unique inode numbers, rather than using the server 
provided inode numbers. This allows Linux clients to work well against 
servers exposing subvolumes and snapshots.

NFS has really old roots and had to make choices that we don't really 
have to make today. Can we not provide something similar to mount.cifs 
that generate unique inode numbers for the clients. This could be either 
an nfsd export option (such as /mnt/foo *(rw,uniq_inodes)) or a mount 
option on the clients.

One worry I have with making subvolumes automountpoints is that it might 
affect the possibility to cp --reflink across that boundary.

[1] https://www.samba.org/~ab/output/htmldocs/manpages-3/mount.cifs.8.html