Re: [PATCH] kernfs: attach uuid for every kernfs and report it in fsid

Ivan Babrou <ivan@xxxxxxxxxxxxxx> · Mon, 10 Jul 2023 14:21:10 -0700

On Mon, Jul 10, 2023 at 12:40 PM Greg Kroah-Hartman
<gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Jul 10, 2023 at 11:33:38AM -0700, Ivan Babrou wrote:
> > The following two commits added the same thing for tmpfs:
> >
> > * commit 2b4db79618ad ("tmpfs: generate random sb->s_uuid")
> > * commit 59cda49ecf6c ("shmem: allow reporting fanotify events with file handles on tmpfs")
> >
> > Having fsid allows using fanotify, which is especially handy for cgroups,
> > where one might be interested in knowing when they are created or removed.
> >
> > Signed-off-by: Ivan Babrou <ivan@xxxxxxxxxxxxxx>
> > ---
> >  fs/kernfs/mount.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index d49606accb07..930026842359 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -16,6 +16,8 @@
> >  #include <linux/namei.h>
> >  #include <linux/seq_file.h>
> >  #include <linux/exportfs.h>
> > +#include <linux/uuid.h>
> > +#include <linux/statfs.h>
> >
> >  #include "kernfs-internal.h"
> >
> > @@ -45,8 +47,15 @@ static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
> >       return 0;
> >  }
> >
> > +int kernfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> > +{
> > +     simple_statfs(dentry, buf);
> > +     buf->f_fsid = uuid_to_fsid(dentry->d_sb->s_uuid.b);
> > +     return 0;
> > +}
> > +
> >  const struct super_operations kernfs_sops = {
> > -     .statfs         = simple_statfs,
> > +     .statfs         = kernfs_statfs,
> >       .drop_inode     = generic_delete_inode,
> >       .evict_inode    = kernfs_evict_inode,
> >
> > @@ -351,6 +360,8 @@ int kernfs_get_tree(struct fs_context *fc)
> >               }
> >               sb->s_flags |= SB_ACTIVE;
> >
> > +             uuid_gen(&sb->s_uuid);
>
> Since kernfs has as lot of nodes (like hundreds of thousands if not more
> at times, being created at boot time), did you just slow down creating
> them all, and increase the memory usage in a measurable way?

This is just for the superblock, not every inode. The memory increase
is one UUID per kernfs instance (there are maybe 10 of them on a basic
system), which is trivial. Same goes for CPU usage.

> We were trying to slim things down, what userspace tools need this
> change?  Who is going to use it, and what for?

The one concrete thing is ebpf_exporter:

* https://github.com/cloudflare/ebpf_exporter

I want to monitor cgroup changes, so that I can have an up to date map
of inode -> cgroup path, so that I can resolve the value returned from
bpf_get_current_cgroup_id() into something that a human can easily
grasp (think system.slice/nginx.service). Currently I do a full sweep
to build a map, which doesn't work if a cgroup is short lived, as it
just disappears before I can resolve it. Unfortunately, systemd
recycles cgroups on restart, changing inode number, so this is a very
real issue.

There's also this old wiki page from systemd:

* https://freedesktop.org/wiki/Software/systemd/Optimizations

Quoting from there:

> Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs.

So a similar need to mine, but for different systemd-related needs.

Initially I tried adding this for cgroup fs only, but the problem felt
very generic, so I pivoted to having it in kernfs instead, so that any
kernfs based filesystem would benefit.

Given pretty much non-existing overhead and simplicity of this, I
think it's a change worth doing, unless there's a good reason to not
do it. I cc'd plenty of people to make sure it's not a bad decision.

> There were some benchmarks people were doing with booting large memory
> systems that you might want to reproduce here to verify that nothing is
> going to be harmed.

Skipping this given that overhead is per superblock and trivial.