On Thu, Apr 08, 2021 at 10:19:35AM +0200, Christian Brauner wrote: > On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote: > > This commit introduces the bpf page cache iterator. This iterator allows > > users to run a bpf prog against each page in the "page cache". > > Internally, the "page cache" is extremely tied to VFS superblock + inode > > combo. Because of this, iter_pagecache will only examine pages in the > > caller's mount namespace. > > > > Signed-off-by: Daniel Xu <dxu@xxxxxxxxx> > > --- > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/pagecache_iter.c | 293 ++++++++++++++++++++++++++++++++++++ > > 2 files changed, 294 insertions(+), 1 deletion(-) > > create mode 100644 kernel/bpf/pagecache_iter.c <...> > > > > +static int init_seq_pagecache(void *priv_data, struct bpf_iter_aux_info *aux) > > +{ > > + struct bpf_iter_seq_pagecache_info *info = priv_data; > > + struct radix_tree_iter iter; > > + struct super_block *sb; > > + struct mount *mnt; > > + void **slot; > > + int err; > > + > > + info->ns = current->nsproxy->mnt_ns; > > + get_mnt_ns(info->ns); > > + INIT_RADIX_TREE(&info->superblocks, GFP_KERNEL); > > + > > + spin_lock(&info->ns->ns_lock); > > + list_for_each_entry(mnt, &info->ns->list, mnt_list) { > > Not just are there helpers for taking ns_lock > static inline void lock_ns_list(struct mnt_namespace *ns) > static inline void unlock_ns_list(struct mnt_namespace *ns) > they are private to fs/namespace.c because it's the only place that > should ever walk this list. Thanks for the hints. Would it be acceptable to add some helpers to fs/namespace.c to allow walking the list? IIUC the only way to find a list of mounts is by looking at the mount namespace. And walking each mount and looking at each `struct super_node`'s inode's `struct address_space` seemed like the cleanest way to walkthe page cache. > This seems buggy: why is it ok here to only take ns_lock and not also > namespace_sem like mnt_already_visible() and __is_local_mountpoint() > or the relevant proc iterators? I might be missing something. Thanks for the hints. I'll take a closer look at the locking. Most probably I didn't get it right. I should have also mentioned in the cover letter that I'm fairly sure I messed up the locking somewhere. > > > + sb = mnt->mnt.mnt_sb; > > + > > + /* The same mount may be mounted in multiple places */ > > + if (radix_tree_lookup(&info->superblocks, (unsigned long)sb)) > > + continue; > > + > > + err = radix_tree_insert(&info->superblocks, > > + (unsigned long)sb, (void *)1); > > + if (err) > > + goto out; > > + } > > + > > + radix_tree_for_each_slot(slot, &info->superblocks, &iter, 0) { > > + sb = (struct super_block *)iter.index; > > + atomic_inc(&sb->s_active); > > It also isn't nice that you mess with sb->s_active directly. > > Imho, this is poking around in a lot of fs/ specific stuff that other > parts of the kernel should not care about or have access to. Re above: do you think it'd be appropriate to add more helpers to fs/ ? <...> Thanks, Daniel