Re: [PATCH 1/2] kernfs: add kernfs_ops.free operation to free resources tied to the file

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 28 Jun 2023 10:41:29 +0200

On Wed, Jun 28, 2023 at 12:46:43AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jun 28, 2023 at 12:26 AM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> >
> > On Tue, Jun 27, 2023 at 08:09:46PM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Jun 27, 2023 at 6:54 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
> > > >
> > > > Hello,
> > > >
> > > > On Tue, Jun 27, 2023 at 02:58:08PM -0700, Suren Baghdasaryan wrote:
> > > > > Ok in kernfs_generic_poll() we are using kernfs_open_node.poll
> > > > > waitqueue head for polling and kernfs_open_node is freed from inside
> > > > > kernfs_unlink_open_file() which is called from kernfs_fop_release().
> > > > > So, it is destroyed only when the last fput() is done, unlike the
> > > > > ops->release() operation which we are using for destroying PSI
> > > > > trigger's waitqueue. So, it seems we still need an operation which
> > > > > would indicate that the file is truly going away.
> > > >
> > > > If we want to stay consistent with how kernfs behaves w.r.t. severing, the
> > > > right thing to do would be preventing any future polling at severing and
> > > > waking up everyone currently waiting, which sounds fine from cgroup behavior
> > > > POV too.
> > >
> > > That's actually what we are currently doing for PSI triggers.
> > > ->release() is handled by cgroup_pressure_release() which signals the
> > > waiters, waits for RCU grace period to pass (per
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L258)
> > > and then releases all the trigger resources including the waitqueue
> > > head. However as reported in
> > > https://lore.kernel.org/all/20230613062306.101831-1-lujialin4@xxxxxxxxxx
> > > this does not save us from the synchronous polling case:
> > >
> > >                                                   do_select
> > >                                                       vfs_poll
> > > cgroup_pressure_release
> > >     psi_trigger_destroy
> > >         wake_up_pollfree(&t->event_wait) -> unblocks vfs_poll
> > >         synchronize_rcu()
> > >         kfree(t) -> frees waitqueue head
> > >                                                      poll_freewait()
> > > -> uses waitqueue head
> > >
> > >
> > > This happens because we release the resources associated with the file
> > > while there are still file users (the file's refcount is non-zero).
> > > And that happens because kernfs can call ->release() before the last
> > > fput().
> > >
> > > >
> > > > Now, the challenge is designing an interface which is difficult to make
> > > > mistake with. IOW, it'd be great if kernfs wraps poll call so that severing
> > > > is implemented without kernfs users doing anything, or at least make it
> > > > pretty obvious what the correct usage pattern is.
> > > >
> > > > > Christian's suggestion to rename current ops->release() operation into
> > > > > ops->drain() (or ops->flush() per Matthew's request) and introduce a
> > > > > "new" ops->release() which is called only when the last fput() is done
> > > > > seems sane to me. Would everyone be happy with that approach?
> > > >
> > > > I'm not sure I'd go there. The contract is that once ->release() is called,
> > > > the code backing that file can go away (e.g. rmmod'd). It really should
> > > > behave just like the last put from kernfs users' POV.
> > >
> > > I 100% agree with the above statement.
> > >
> > > > For this specific fix,
> > > > it's safe because we know the ops is always built into the kernel and won't

I don't know if this talks about kernfs_ops (likely) or talks about
f_ops but fyi for f_ops this isn't a problem. See fops_get() in
do_dentry_open() which takes a reference on the mode that provides the
fops. And debugfs - a little more elaborately - handles this as well.

> > > > go away but it'd be really bad if the interface says "this is a normal thing
> > > > to do". We'd be calling into rmmod'd text pages in no time.
> > > >
> > > > So, I mean, even for temporary fix, we have to make it abundantly clear that
> > > > this is not for usual usage and can only be used if the code backing the ops
> > > > is built into the kernel and so on.
> > >
> > > I think the root cause of this problem is that ->release() in kernfs
> > > does not adhere to the common rule that ->release() is called only
> > > when the file is going away and has no users left. Am I wrong?
> >
> > So imho, ultimately this all comes down to rmdir() having special
> > semantics in kernfs. On any regular filesystem an rmdir() on a directory
> > which is still referenced by a struct file doesn't trigger an
> > f_op->release() operation. It's just that directory is unlinked and
> > you get some sort of errno like ENOENT when you try to create new files
> > in there or whatever. The actual f_op->release) however is triggered
> > on last fput().
> >
> > But in essence, kernfs treats an rmdir() operation as being equivalent
> > to a final fput() such that it somehow magically kills all file
> > references. And that's just wrong and not supported.
> 
> Thanks for the explanation, Christian!
> If kernfs is special and needs different rules for calling
> f_op->release() then fine, but I need an operation which tells me
> there are no users of the file so that I can free the resources.
> What's the best way to do that?

Imho, if there's still someone with an fd referencing that file then
there's still a user. That's unlink() while holding an fd in a nutshell.

But generically, afaui what you seem to want is:

(1) a way to shutdown functionality provided by a kernfs node on removal
    of that node
(2) a way to release the resources of a kernfs node

So (2) is seemingly what kernfs_ops->release() is about but it's also
used for (1). So while I initially thought about ->drain() or whatever
it seems what you really want is for struct kernfs_ops to gain an
unlink()/remove()/rmdir() method(s). The method can be implemented by
interested callers and should be called when the kernfs node is removed.

And that's when you can shutdown any functionality without freeing the
resources.

Imho, if you add struct gimmegimme_ops with same names as f_op you
really to mirror them as close as possible otherwise you're asking for
problems.