Re: [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 30 Jul 2021 08:28:05 +0300

On Fri, Jul 30, 2021 at 5:41 AM NeilBrown <neilb@xxxxxxx> wrote:
>
>
> I've been pondering all the excellent feedback, and what I have learnt
> from examining the code in btrfs, and I have developed a different
> perspective.
>
> Maybe "subvol" is a poor choice of name because it conjures up
> connections with the Volumes in LVM, and btrfs subvols are very different
> things.  Btrfs subvols are really just subtrees that can be treated as a
> unit for operations like "clone" or "destroy".
>
> As such, they don't really deserve separate st_dev numbers.
>
> Maybe the different st_dev numbers were introduced as a "cheap" way to
> extend to size of the inode-number space.  Like many "cheap" things, it
> has hidden costs.
>
> Maybe objects in different subvols should still be given different inode
> numbers.  This would be problematic on 32bit systems, but much less so on
> 64bit systems.
>
> The patch below, which is just a proof-of-concept, changes btrfs to
> report a uniform st_dev, and different (64bit) st_ino in different subvols.
>
> It has problems:
>  - it will break any 32bit readdir and 32bit stat.  I don't know how big
>    a problem that is these days (ino_t in the kernel is "unsigned long",
>    not "unsigned long long). That surprised me).
>  - It might break some user-space expectations.  One thing I have learnt
>    is not to make any assumption about what other people might expect.
>
> However, it would be quite easy to make this opt-in (or opt-out) with a
> mount option, so that people who need the current inode numbers and will
> accept the current breakage can keep working.
>
> I think this approach would be a net-win for NFS export, whether BTRFS
> supports it directly or not.  I might post a patch which modifies NFS to
> intuit improved inode numbers for btrfsdemostrates exports....
>
> So: how would this break your use-case??

The simple cases are find -xdev and du -x which expect the st_dev
change, but that can be excused if opting in to a unified st_dev namespace.

The harder problem is <st_dev;st_ino> collisions which are not even
that hard to hit with unlimited number of snapshots.
The 'diff' tool demonstrates the implications of <st_dev;st_ino>
collisions for different objects on userspace.
See xfstest overlay/049 for a demonstration.

The overlayfs xino feature made a similar change to overlayfs
<st_dev;st_ino> with one big difference - applications expect that
all objects in overlayfs mount will have the same st_dev.

Also, overlayfs has prior knowledge on the number of layers
so it is easier to parcel the ino namespace and avoid collisions.

Thanks,
Amir.