On Wed, Jun 28, 2023 at 12:26 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Tue, Jun 27, 2023 at 08:09:46PM -0700, Suren Baghdasaryan wrote: > > On Tue, Jun 27, 2023 at 6:54 PM Tejun Heo <tj@xxxxxxxxxx> wrote: > > > > > > Hello, > > > > > > On Tue, Jun 27, 2023 at 02:58:08PM -0700, Suren Baghdasaryan wrote: > > > > Ok in kernfs_generic_poll() we are using kernfs_open_node.poll > > > > waitqueue head for polling and kernfs_open_node is freed from inside > > > > kernfs_unlink_open_file() which is called from kernfs_fop_release(). > > > > So, it is destroyed only when the last fput() is done, unlike the > > > > ops->release() operation which we are using for destroying PSI > > > > trigger's waitqueue. So, it seems we still need an operation which > > > > would indicate that the file is truly going away. > > > > > > If we want to stay consistent with how kernfs behaves w.r.t. severing, the > > > right thing to do would be preventing any future polling at severing and > > > waking up everyone currently waiting, which sounds fine from cgroup behavior > > > POV too. > > > > That's actually what we are currently doing for PSI triggers. > > ->release() is handled by cgroup_pressure_release() which signals the > > waiters, waits for RCU grace period to pass (per > > https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L258) > > and then releases all the trigger resources including the waitqueue > > head. However as reported in > > https://lore.kernel.org/all/20230613062306.101831-1-lujialin4@xxxxxxxxxx > > this does not save us from the synchronous polling case: > > > > do_select > > vfs_poll > > cgroup_pressure_release > > psi_trigger_destroy > > wake_up_pollfree(&t->event_wait) -> unblocks vfs_poll > > synchronize_rcu() > > kfree(t) -> frees waitqueue head > > poll_freewait() > > -> uses waitqueue head > > > > > > This happens because we release the resources associated with the file > > while there are still file users (the file's refcount is non-zero). > > And that happens because kernfs can call ->release() before the last > > fput(). > > > > > > > > Now, the challenge is designing an interface which is difficult to make > > > mistake with. IOW, it'd be great if kernfs wraps poll call so that severing > > > is implemented without kernfs users doing anything, or at least make it > > > pretty obvious what the correct usage pattern is. > > > > > > > Christian's suggestion to rename current ops->release() operation into > > > > ops->drain() (or ops->flush() per Matthew's request) and introduce a > > > > "new" ops->release() which is called only when the last fput() is done > > > > seems sane to me. Would everyone be happy with that approach? > > > > > > I'm not sure I'd go there. The contract is that once ->release() is called, > > > the code backing that file can go away (e.g. rmmod'd). It really should > > > behave just like the last put from kernfs users' POV. > > > > I 100% agree with the above statement. > > > > > For this specific fix, > > > it's safe because we know the ops is always built into the kernel and won't > > > go away but it'd be really bad if the interface says "this is a normal thing > > > to do". We'd be calling into rmmod'd text pages in no time. > > > > > > So, I mean, even for temporary fix, we have to make it abundantly clear that > > > this is not for usual usage and can only be used if the code backing the ops > > > is built into the kernel and so on. > > > > I think the root cause of this problem is that ->release() in kernfs > > does not adhere to the common rule that ->release() is called only > > when the file is going away and has no users left. Am I wrong? > > So imho, ultimately this all comes down to rmdir() having special > semantics in kernfs. On any regular filesystem an rmdir() on a directory > which is still referenced by a struct file doesn't trigger an > f_op->release() operation. It's just that directory is unlinked and > you get some sort of errno like ENOENT when you try to create new files > in there or whatever. The actual f_op->release) however is triggered > on last fput(). > > But in essence, kernfs treats an rmdir() operation as being equivalent > to a final fput() such that it somehow magically kills all file > references. And that's just wrong and not supported. Thanks for the explanation, Christian! If kernfs is special and needs different rules for calling f_op->release() then fine, but I need an operation which tells me there are no users of the file so that I can free the resources. What's the best way to do that? > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx. >