On Sun, 7 Jan 2024 13:42:39 +0100 Christian Brauner <brauner@xxxxxxxxxx> wrote: > > So, I tried to do an exploratory patch even though I promised myself not > to do it. But hey... > > Some notes: > > * Permission handling for idmapped mounts is done completely in the > VFS. That's the case for all filesytems that don't have a custom > ->permission() handler. So there's nothing to do for us here. > > * Idmapped mount handling for ->getattr() is done completely by the VFS > if the filesystem doesn't have a custom ->getattr() handler. So we're > done here. > > * Tracefs doesn't support attribute changes via ->setattr() (chown(), > chmod etc.). So there's nothing to here. > > * Eventfs does support attribute changes via ->setattr(). But it relies > on simple_setattr() which is already idmapped mount aware. So there's > nothing for us to do. > > * Ownership is inherited from the parent inode (tracefs) or optionally > from stashed ownership information (eventfs). That means the idmapping > is irrelevant here. It's similar to the "inherit gid from parent > directory" logic we have in some circumstances. TL;DR nothing to do > here as well. The reason ownership is inherited from the parent is because the inodes are created at boot up before user space starts. eventfs inodes are created on demand after user space so it needs to either check the "default" ownership and permissions, or if the user changed a specific file/directory, it must save it and use that again if the inode/dentry are reclaimed and then referenced again and recreated. > > * Tracefs supports the creation of instances from userspace via mkdir. > For example, > > mkdir /sys/kernel/tracing/instances/foo > > And here the idmapping is relevant so we need to make the helpers > aware of the idmapping. > > I just went and plumbed this through to most helpers. > > There's some subtlety in eventfs. Afaict, the directories and files for > the individual events are created on-demand during lookup or readdir. > > The ownership of these events is again inherited from the parent inode > or recovered from stored state. In both cases the actual idmapping is > irrelevant. > > The callchain here is: > > eventfs_root_lookup("xfs", "events") > -> create_{dir,file}_dentry("xfs", "events") > -> create_{dir,file}("xfs", "events") > -> eventfs_start_creating("xfs", "events") > -> lookup_one_len("xfs", "events") > > And the subtlety is that lookup_one_len() does permission checking on > the parent inode (IOW, if you want a dentry for "blech" under "events" > it'll do a permission check on events->d_inode) for exec permissions > and then goes on to give you a new dentry. > > Usually this call would have to be changed to lookup_one() and the > idmapping be handed down to it. But I think that's irrelevant here. > > Lookup generally doesn't need to be aware of idmappings at all. The > permission checking is done purely in the vfs via may_lookup() and the > idmapping is irrelevant because we always initialize inodes with the > filesystem level ownership (see the idmappings.rst) documentation if > you're interested in excessive details (otherwise you get inode aliases > which you really don't want). > > For tracefs it would not matter for lookup per se but only because > tracefs seemingly creates inodes/dentries during lookup (and readdir()). tracefs creates the inodes/dentries at boot up, it's eventfs that does it on demand during lookup. For inodes/dentries: /sys/kernel/tracing/* is all created at boot up, except for "events". /sys/kernel/tracing/events/* is created on demand. > > But imho the permission checking done in current eventfs_root_lookup() > via lookup_one_len() is meaningless in any way; possibly even > (conceptually) wrong. > > Because, the actual permission checking for the creation of the eventfs > entries isn't really done during lookup or readdir, it's done when mkdir > is called: > > mkdir /sys/kernel/tracing/instances/foo No. that creates a entire new tracefs instance, which happens to include another eventfs directory. The eventsfs directory is in "/sys/kernel/tracing/events" and "/sys/kernel/tracing/instances/*/events" eventfs has 10s of thousands of files and directories which is why I changed it to be on-demand inode/dentry creation and also reclaiming when no longer accessed. # ls /sys/kernel/tracing/events/ will create the inodes and dentries, and a memory stress program will free those created inodes and dentries. > > Here, all possible entries beneath foo including "events" and further > below are recorded and stored. So once mkdir returns it basically means > that it succeeded with the creation of all the necessary directories and > files. For all purposes the foo/events/ directory and below have all the > entries that matter. They have been created. It's comparable to them not > being in the {d,i}cache, I guess. No. Only the meta data is created for the eventfs directory with a mkdir instances/foo. The inodes and dentries are not there. > > When one goes and looksup stuff under foo/events/ or readdir the entries > in that directory: > > fd = open("foo/events") > readdir(fd, ...) > > then they are licensed to list an entry in that directory. So all that > needs to be done is to actually list those files in that directory. And > since they already exist (they were created during mkdir) we just need > to splice in inodes and dentries for them. But for that we shouldn't > check permissions on the directory again. Because we've done that > already correctly when the VFS called may_lookup(). No they do not exist. > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > redundant and just wrong. I don't think so. > > Luckily, I don't think we need to even change anything because all > directories that eventfs creates always grant exec permissions to the > other group so lookup_one_len() will trivially succeed. IIUC. > > Drafted-by-with-no-guarantees-whatsoever-that-this-wont-burn-the-house-down: Christian Brauner <brauner@xxxxxxxxxx> Checkout this branch: git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git eventfs-show-files It creates a file "/sys/kernel/tracing/show_events_dentries" That shows when inodes and dentries are created and freed. I use this to make sure that reclaim actually does reclaim them. Linus did not like this file, but it has become very useful to make sure things are working properly as there is no other way to know if the inodes and dentries are reclaimed or not. I may see if he'll let me add it with a debug config option. -- Steve