On Fri, Jul 30, 2021 at 5:41 AM NeilBrown <neilb@xxxxxxx> wrote: > > > I've been pondering all the excellent feedback, and what I have learnt > from examining the code in btrfs, and I have developed a different > perspective. > > Maybe "subvol" is a poor choice of name because it conjures up > connections with the Volumes in LVM, and btrfs subvols are very different > things. Btrfs subvols are really just subtrees that can be treated as a > unit for operations like "clone" or "destroy". > > As such, they don't really deserve separate st_dev numbers. > > Maybe the different st_dev numbers were introduced as a "cheap" way to > extend to size of the inode-number space. Like many "cheap" things, it > has hidden costs. > > Maybe objects in different subvols should still be given different inode > numbers. This would be problematic on 32bit systems, but much less so on > 64bit systems. > > The patch below, which is just a proof-of-concept, changes btrfs to > report a uniform st_dev, and different (64bit) st_ino in different subvols. > > It has problems: > - it will break any 32bit readdir and 32bit stat. I don't know how big > a problem that is these days (ino_t in the kernel is "unsigned long", > not "unsigned long long). That surprised me). > - It might break some user-space expectations. One thing I have learnt > is not to make any assumption about what other people might expect. > > However, it would be quite easy to make this opt-in (or opt-out) with a > mount option, so that people who need the current inode numbers and will > accept the current breakage can keep working. > > I think this approach would be a net-win for NFS export, whether BTRFS > supports it directly or not. I might post a patch which modifies NFS to > intuit improved inode numbers for btrfsdemostrates exports.... > > So: how would this break your use-case?? The simple cases are find -xdev and du -x which expect the st_dev change, but that can be excused if opting in to a unified st_dev namespace. The harder problem is <st_dev;st_ino> collisions which are not even that hard to hit with unlimited number of snapshots. The 'diff' tool demonstrates the implications of <st_dev;st_ino> collisions for different objects on userspace. See xfstest overlay/049 for a demonstration. The overlayfs xino feature made a similar change to overlayfs <st_dev;st_ino> with one big difference - applications expect that all objects in overlayfs mount will have the same st_dev. Also, overlayfs has prior knowledge on the number of layers so it is easier to parcel the ino namespace and avoid collisions. Thanks, Amir.