Re: [PATCH 0/3] fanotify support for btrfs sub-volumes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon 06-11-23 10:52:24, Christian Brauner wrote:
> On Mon, Nov 06, 2023 at 10:03:55AM +0100, Jan Kara wrote:
> > On Fri 03-11-23 16:47:02, Christian Brauner wrote:
> > > On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote:
> > > > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> > > > > But at that point we really need to ask if it makes sense to use
> > > > > vfsmounts per subvolume in the first place:
> > > > > 
> > > > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> > > > > (2) By calling ->getattr() from show_mountinfo() we open the whole
> > > > >     system up to deadlocks.
> > > > > (3) We change btrfs semantics drastically to the point where they need a
> > > > >     new mount, module, or Kconfig option.
> > > > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight
> > > > >     because you need to create a mount for the subvolume.
> > > > > 
> > > > > So right now, I don't see how we can make this work even if the concept
> > > > > doesn't seem necessarily wrong.
> > > > 
> > > > How else do you want to solve it?  Crossing a mount point is the
> > > > only legitimate boundary for changing st_dev and having a new inode
> > > > number space.  And we can't fix that retroactively.
> > > 
> > > I think the idea of using vfsmounts for this makes some sense if the
> > > goal is to retroactively justify and accommodate the idea that a
> > > subvolume is to be treated as equivalent to a separate device.
> > > 
> > > I question that premise though. I think marking them with separate
> > > device numbers is bringing us nothing but pain at this point and this
> > > solution is basically bending the vfs to make that work somehow.
> > > 
> > > And the worst thing is that I think that treating subvolumes like
> > > vfsmounts will hurt vfsmounts more than it will hurt subvolumes.
> > > 
> > > Right now all that vfsmounts technically are is a topological
> > > abstraction on top of filesystem objects such as files, directories,
> > > sockets, even devices that are exposed as filesystems objects. None of
> > > them get to muck with core properties of what a vfsmount is though.
> > > 
> > > Additionally, vfsmount are tied to a superblock and share the device
> > > numbers with the superblock they belong to.
> > > 
> > > If we make subvolumes and vfsmounts equivalent we break both properties.
> > > And I think that's wrong or at least really ugly.
> > > 
> > > And I already see that the suggested workaround for (2) will somehow end
> > > up being stashing device numbers in struct mount or struct vfsmount so
> > > we can show it in mountinfo and if that's the case I want to express a
> > > proactive nak for that solution.
> > > 
> > > The way I see it is that a subvolume at the end is nothing but a
> > > subclass of directories a special one but whatever.
> > 
> > As far as I understand the problem, subvolumes indeed seem closer to
> > special directories than anything else. They slightly resemble what ext4 &
> > xfs implement with project quotas (were each inode can have additional
> > recursively inherited "project id"). What breaks this "special directory"
> > kind of view for btrfs is that subvolumes have overlapping inode numbers.
> > Since we don't seem to have a way of getting out of the current situation
> > in a "seamless" way anyway, I wonder if implementing a btrfs feature to
> > provide unique inode numbers across all subvolumes would not be the
> > cleanest way out...
> > 
> > > I would feel much more comfortable if the two filesystems that expose
> > > these objects give us something like STATX_SUBVOLUME that userspace can
> > > raise in the request mask of statx().
> > > 
> > > If userspace requests STATX_SUBVOLUME in the request mask, the two
> > > filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> > > return the _real_ device number of the superblock and stop exposing that
> > > made up device number.
> > > 
> > > This can be accompanied by a vfs ioctl that is identical for both btrfs
> > > and bcachefs and returns $whatever unique property to mark the inode
> > > space of the subvolume.
> > > 
> > > And then we leave innocent vfs objects alone and we also avoid
> > > bringing in all that heavy vfsmount machinery on top of subvolumes.
> > 
> > Well, but this requires application knowledge of a new type of object - a
> > subvolume. So you'd have to teach all applications that try to identify
> > whether two "filenames" point to the same object or not about this and that
> > seems like a neverending story. Hence either we will live with fake devices
> 
> But that is what's happening today already, no? All tools need to figure
> out that they are on a btrfs subvolume somehow whenever they want to do
> something meaningful to it. systemd code is full of special btrfs
> handling code.

Yes, for systemd, util-linux or similar tools, there's probably no way they
can avoid knowing about btrfs. If your API makes life easier for them, sure
we can do it. But I was speaking more about tools like diff or tar which
want to find out if two paths lead to the same object (inode) or not. For
such tools I'd hope we can avoid introducing the special subvolume
awareness...

> I don't understand why we're bending and breaking ourselves to somehow
> make a filesystem specific, special object fit into standard apis when
> it clearly breaks standard apis?

Firstly, I'm not hung up on any particular solution (or even keeping status
quo). I was under the impression (maybe wrong) that Christoph would like to
eventually get rid of reporting different st_dev in stat(2) for different
subvolumes and different fsids in statfs(2) as well. So I was thinking
about possibilities for that.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux