On Wed, Jul 12, 2023 at 1:04 AM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote: > > On Tue, Jul 11, 2023 at 2:49 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > On Tue, Jul 11, 2023 at 12:21 AM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote: > > > > > > On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman > > > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote: > > > > > The following two commits added the same thing for tmpfs: > > > > > > > > > > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid") > > > > > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs") > > > > > > > > > > Having fsid allows using fanotify, which is especially handy for cgroups, > > > > > where one might be interested in knowing when they are created or removed. > > > > > > > > > > Signed-off-by: Ivan Babrou <ivan@xxxxxxxxxxxxxx> > > > > > --- > > > > > fs/kernfs/mount.c | 13 ++++++++++++- > > > > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c > > > > > index d49606accb07..930026842359 100644 > > > > > --- a/fs/kernfs/mount.c > > > > > +++ b/fs/kernfs/mount.c > > > > > @@ -16,6 +16,8 @@ > > > > > #include <linux/namei.h> > > > > > #include <linux/seq_file.h> > > > > > #include <linux/exportfs.h> > > > > > +#include <linux/uuid.h> > > > > > +#include <linux/statfs.h> > > > > > > > > > > #include "kernfs-internal.h" > > > > > > > > > > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry) > > > > > return 0; > > > > > } > > > > > > > > > > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf) > > > > > +{ > > > > > + simple_statfs(dentry, buf); > > > > > + buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b); > > > > > + return 0; > > > > > +} > > > > > + > > > > > const struct super_operations kernfs_sops = { > > > > > - .statfs = simple_statfs, > > > > > + .statfs = kernfs_statfs, > > > > > .drop_inode = generic_delete_inode, > > > > > .evict_inode = kernfs_evict_inode, > > > > > > > > > > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc) > > > > > } > > > > > sb->s_flags |= SB_ACTIVE; > > > > > > > > > > + uuid_gen(&sb->s_uuid); > > > > > > > > Since kernfs has as lot of nodes (like hundreds of thousands if not more > > > > at times, being created at boot time), did you just slow down creating > > > > them all, and increase the memory usage in a measurable way? > > > > > > This is just for the superblock, not every inode. The memory increase > > > is one UUID per kernfs instance (there are maybe 10 of them on a basic > > > system), which is trivial. Same goes for CPU usage. > > > > > > > We were trying to slim things down, what userspace tools need this > > > > change? Who is going to use it, and what for? > > > > > > The one concrete thing is ebpf_exporter: > > > > > > * https://github.com/cloudflare/ebpf_exporter > > > > > > I want to monitor cgroup changes, so that I can have an up to date map > > > of inode -> cgroup path, so that I can resolve the value returned from > > > bpf_get_current_cgroup_id() into something that a human can easily > > > grasp (think system.slice/nginx.service). Currently I do a full sweep > > > to build a map, which doesn't work if a cgroup is short lived, as it > > > just disappears before I can resolve it. Unfortunately, systemd > > > recycles cgroups on restart, changing inode number, so this is a very > > > real issue. > > > > > > There's also this old wiki page from systemd: > > > > > > * https://freedesktop.org/wiki/Software/systemd/Optimizations > > > > > > Quoting from there: > > > > > > > Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. > > > > > > So a similar need to mine, but for different systemd-related needs. > > > > > > Initially I tried adding this for cgroup fs only, but the problem felt > > > very generic, so I pivoted to having it in kernfs instead, so that any > > > kernfs based filesystem would benefit. > > > > > > Given pretty much non-existing overhead and simplicity of this, I > > > think it's a change worth doing, unless there's a good reason to not > > > do it. I cc'd plenty of people to make sure it's not a bad decision. > > > > > > > I agree. I think it was a good decision. > > I have some followup questions though. > > > > I guess your use case cares about the creation of cgroups? > > as long as the only way to create a cgroup is via vfs > > vfs_mkdir() -> ... cgroup_mkdir() > > fsnotify_mkdir() will be called. > > Is that a correct statement? > > As far as I'm aware, this is the only way. We have the cgroups mailing > list CC'd to confirm. > > I checked systemd and docker as real world consumers and both use > mkdir and are visible in fanotify with this patch applied. > > > Because if not, then explicit fsnotify_mkdir() calls may be needed > > similar to tracefs/debugfs. > > > > I don't think that the statement holds for dieing cgroups, > > so explicit fsnotify_rmdir() are almost certainly needed to make > > inotify/fanotify monitoring on cgroups complete. > > > > I am on the fence w.r.t making the above a prerequisite to merging > > your patch. > > > > One the one hand, inotify monitoring of cgroups directory was already > > possible (I think?) with the mentioned shortcomings for a long time. > > > > On the other hand, we have an opportunity to add support to fanotify > > monitoring of cgroups directory only after the missing fsnotify hooks > > are added, making fanotify API a much more reliable option for > > monitoring cgroups. > > > > So I am leaning towards requiring the missing fsnotify hooks before > > attaching a unique fsid to cgroups/kernfs. > > Unless somebody responsible for cgroups says there's a different way > to create cgroups, I think this requirement doesn't apply. > I was more concerned about the reliability of FAN_DELETE for dieing cgroups without an explicit rmdir() from userspace. > > In any case, either with or without the missing hooks, I would not > > want this patch merged until Jan had a chance to look at the > > implications and weigh in on the missing hooks question. > > Jan is on vacation for three weeks, so in the meanwhile, feel free > > to implement and test the missing hooks or wait for his judgement. > > Sure, I can definitely wait. > > > On an unrelated side topic, > > I would like to point your attention to this comment in the patch that > > was just merged to v6.5-rc1: > > > > 69562eb0bd3e ("fanotify: disallow mount/sb marks on kernel internal pseudo fs") > > > > /* > > * mount and sb marks are not allowed on kernel internal pseudo fs, > > * like pipe_mnt, because that would subscribe to events on all the > > * anonynous pipes in the system. > > * > > * SB_NOUSER covers all of the internal pseudo fs whose objects are not > > * exposed to user's mount namespace, but there are other SB_KERNMOUNT > > * fs, like nsfs, debugfs, for which the value of allowing sb and mount > > * mark is questionable. For now we leave them alone. > > */ > > > > My question to you, as the only user I know of for fanotify FAN_REPORT_FID > > on SB_KERNMOUNT, do you have plans to use a mount or filesystem mark > > to monitor cgroups? or only inotify-like directory watches? > > My plan is to use FAN_MARK_FILESYSTEM for the whole cgroup mount, > since that is what I was able to make work with my limited > understanding of the whole fanotify thing. I started with > fanotify_fid.c example from here: > > * https://man7.org/linux/man-pages/man7/fanotify.7.html > > My existing code does the mark this way (on v6.5-rc1 with my patch applied): > > ret = fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_ONLYDIR | > FAN_MARK_FILESYSTEM, FAN_CREATE | FAN_DELETE | FAN_ONDIR, AT_FDCWD, > argv[1]); > > My goal is to set a watch for all cgroups and drop capabilities, so > that I can keep monitoring for events while being unprivileged. As long as you do not need to open_by_handle_at() or as long as you retain CAP_DAC_READ_SEARCH. > As far > as I'm aware, this sort of recursive monitoring without races isn't > possible with inode level monitoring (I might be wrong here). You are not wrong. > > I do get -EINVAL for FAN_MARK_MOUNT instead of FAN_MARK_FILESYSTEM. Right. FAN_CREATE | FAN_DELETE not supported with FAN_MARK_MOUNT. Thanks, Amir.