Re: [GIT PULL] bcachefs

Jens Axboe <axboe@xxxxxxxxx> · Wed, 28 Jun 2023 19:29:33 -0600

On 6/28/23 5:50?PM, Kent Overstreet wrote:
> On Wed, Jun 28, 2023 at 05:14:09PM -0600, Jens Axboe wrote:
>> On 6/28/23 4:55?PM, Kent Overstreet wrote:
>>>> But it's not aio (or io_uring or whatever), it's simply the fact that
>>>> doing an fput() from an exiting task (for example) will end up being
>>>> done async. And hence waiting for task exits is NOT enough to ensure
>>>> that all file references have been released.
>>>>
>>>> Since there are a variety of other reasons why a mount may be pinned and
>>>> fail to umount, perhaps it's worth considering that changing this
>>>> behavior won't buy us that much. Especially since it's been around for
>>>> more than 10 years:
>>>
>>> Because it seems that before io_uring the race was quite a bit harder to
>>> hit - I only started seeing it when things started switching over to
>>> io_uring. generic/388 used to pass reliably for me (pre backpointers),
>>> now it doesn't.
>>
>> I literally just pasted a script that hits it in one second with aio. So
>> maybe generic/388 doesn't hit it as easily, but it's surely TRIVIAL to
>> hit with aio. As demonstrated. The io_uring is not hard to bring into
>> parity on that front, here's one I posted earlier today for 6.5:
>>
>> https://lore.kernel.org/io-uring/20230628170953.952923-4-axboe@xxxxxxxxx/
>>
>> Doesn't change the fact that you can easily hit this with io_uring or
>> aio, and probably more things too (didn't look any further). Is it a
>> realistic thing outside of funky tests? Probably not really, or at least
>> if those guys hit it they'd probably have the work-around hack in place
>> in their script already.
>>
>> But the fact is that it's been around for a decade. It's somehow a lot
>> easier to hit with bcachefs than XFS, which may just be because the
>> former has a bunch of workers and this may be deferring the delayed fput
>> work more. Just hand waving.
> 
> Not sure what you're arguing here...?
> 
> We've had a long standing bug, it's recently become much easier to hit
> (for multiple reasons); we seem to be in agreement on all that. All I'm
> saying is that the existence of that bug previously is not reason to fix
> it now.

Not really arguing, just stating that it's not a huge problem as it's
not something that real world would tend to do and probably why we saw
it in a test case instead.

>>>> then we'd probably want to move that deferred fput list to the
>>>> task_struct and ensure that it gets run if the task exits rather than
>>>> have a global deferred list. Currently we have:
>>>>
>>>>
>>>> 1) If kthread or in interrupt
>>>> 	1a) add to global fput list
>>>> 2) task_work_add if not. If that fails, goto 1a.
>>>>
>>>> which would then become:
>>>>
>>>> 1) If kthread or in interrupt
>>>> 	1a) add to global fput list
>>>> 2) task_work_add if not. If that fails, we know task is existing. add to
>>>>    per-task defer list to be run at a convenient time before task has
>>>>    exited.
>>>
>>> no, it becomes:
>>>  if we're running in a user task, or if we're doing an operation on
>>>  behalf of a user task, add to the user task's deferred list: otherwise
>>>  add to global deferred list.
>>
>> And how would the "on behalf of a user task" work in terms of being
>> in_interrupt()?
> 
> I don't see any relation to in_interrupt?

Just saying that you'd now need the task passed in.

> We'd have to add a version of fput() that takes an additional
> task_struct argument, and plumb that through the aio code - kioctx
> lifetime is tied to mm_struct, not task_struct, so we'd have to add a
> ref to the task_struct to kiocb.
> 
> Which would probably be a good thing tbh, it'd let us e.g. account cpu
> time back to the original task when kiocb completion has to run out of a
> workqueue.

Might also introduce some funky dependencies. Probably not an issue it
tied to the aio_kiocb. If you go ahead with that, just make sure you
keep the task referencing out of the fput variant for users that don't
need that.

-- 
Jens Axboe