Re: [GIT PULL] bcachefs

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 28 Jun 2023 19:50:18 -0400

On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
> On 6/28/23 4:55?PM, Kent Overstreet wrote:
> >> But it's not aio (or io_uring or whatever), it's simply the fact that
> >> doing an fput() from an exiting task (for example) will end up being
> >> done async. And hence waiting for task exits is NOT enough to ensure
> >> that all file references have been released.
> >>
> >> Since there are a variety of other reasons why a mount may be pinned and
> >> fail to umount, perhaps it's worth considering that changing this
> >> behavior won't buy us that much. Especially since it's been around for
> >> more than 10 years:
> > 
> > Because it seems that before io_uring the race was quite a bit harder to
> > hit - I only started seeing it when things started switching over to
> > io_uring. generic/388 used to pass reliably for me (pre backpointers),
> > now it doesn't.
> 
> I literally just pasted a script that hits it in one second with aio. So
> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
> hit with aio. As demonstrated. The io_uring is not hard to bring into
> parity on that front, here's one I posted earlier today for 6.5:
> 
> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@xxxxxxxxx/
> 
> Doesn't change the fact that you can easily hit this with io_uring or
> aio, and probably more things too (didn't look any further). Is it a
> realistic thing outside of funky tests? Probably not really, or at least
> if those guys hit it they'd probably have the work-around hack in place
> in their script already.
> 
> But the fact is that it's been around for a decade. It's somehow a lot
> easier to hit with bcachefs than XFS, which may just be because the
> former has a bunch of workers and this may be deferring the delayed fput
> work more. Just hand waving.

Not sure what you're arguing here...?

We've had a long standing bug, it's recently become much easier to hit
(for multiple reasons); we seem to be in agreement on all that. All I'm
saying is that the existence of that bug previously is not reason to fix
it now.

> >> then we'd probably want to move that deferred fput list to the
> >> task_struct and ensure that it gets run if the task exits rather than
> >> have a global deferred list. Currently we have:
> >>
> >>
> >> 1) If kthread or in interrupt
> >> 	1a) add to global fput list
> >> 2) task_work_add if not. If that fails, goto 1a.
> >>
> >> which would then become:
> >>
> >> 1) If kthread or in interrupt
> >> 	1a) add to global fput list
> >> 2) task_work_add if not. If that fails, we know task is existing. add to
> >>    per-task defer list to be run at a convenient time before task has
> >>    exited.
> > 
> > no, it becomes:
> >  if we're running in a user task, or if we're doing an operation on
> >  behalf of a user task, add to the user task's deferred list: otherwise
> >  add to global deferred list.
> 
> And how would the "on behalf of a user task" work in terms of being
> in_interrupt()?

I don't see any relation to in_interrupt?

We'd have to add a version of fput() that takes an additional
task_struct argument, and plumb that through the aio code - kioctx
lifetime is tied to mm_struct, not task_struct, so we'd have to add a
ref to the task_struct to kiocb.

Which would probably be a good thing tbh, it'd let us e.g. account cpu
time back to the original task when kiocb completion has to run out of a
workqueue.