On 2023/11/6 18:23, Christoph Hellwig wrote:
On Fri, Nov 03, 2023 at 04:47:02PM +0100, Christian Brauner wrote:
I think the idea of using vfsmounts for this makes some sense if the
goal is to retroactively justify and accommodate the idea that a
subvolume is to be treated as equivalent to a separate device.
st_dev has only been very historically about treating something as
a device. For userspae the most important part is that it designates
a separate domain for inode numbers. And that's something that's simply
broken in btrfs.
In fact, I'm not sure if the "treating something as a device" thing is
even correct long before btrfs.
For example, for an EXT4 fs with external log device. Thankfully it's
still more or less obvious we would use the device number of the main
fs, not the log device, but we already had such examples.
Another thing is, the st_dev situation has to be kept, as there are too
many legacy programs that relies on this to distinguish btrfs subvolume
boundaries, this would never be changed unfortunately, even if we had
some better solution (like the proposed extra subvolid through statx).
I question that premise though. I think marking them with separate
device numbers is bringing us nothing but pain at this point and this
solution is basically bending the vfs to make that work somehow.
Well, the only other theoretical option would be to use a simple
inode number space across subvolumes in btrfs, but I don't really
see how that could be retrofitted in any sensible way.
I would feel much more comfortable if the two filesystems that expose
these objects give us something like STATX_SUBVOLUME that userspace can
raise in the request mask of statx().
Except that this doesn't fix any existing code.
To me, the biggest btrfs specific problem is the number of btrfs
subvolumes vs the very limited amount of anonymous device number pool.
As long as we don't expand the st_dev width, nor change the behavior of
per-subvolume st_dev number, the only thing I can came up with is
allowing manually "unmounting" a subvolume to reclaim the anonymous
device number.
Which I believe the per-subvolume-vfsmount and the automount behavior
for subvolume can help a lot.
If userspace requests STATX_SUBVOLUME in the request mask, the two
filesystems raise STATX_SUBVOLUME in the statx result mask and then also
return the _real_ device number of the superblock and stop exposing that
made up device number.
Btrfs goes the anonymous device number pool because we don't have any
better way to return a "real" device number.
There may be 1 or whatever number of devices, verse way more number of
subvolumes.
Thus we go the "nature" idea to go anonymous device number pool, but as
we can all see already, the pool is not large enough for subvolumes.
What is a "real" device number?
I'm more interested in if we can allocate st_dev from other pools.
IIRC logical volumes (LV from LVM) are not allocating from anonymous dev
number pool, thus this may sound a stupid question, but what's
preventing us from using the device number pool of LVM?
Device number conflicts or something else?
Thanks,
Qu