Re: [PATCH 4/4] io_uring: flush task work before waiting for ring exit

Jens Axboe <axboe@xxxxxxxxx> · Wed, 8 Apr 2020 12:06:24 -0700

On 4/8/20 11:48 AM, Jens Axboe wrote:
> On 4/8/20 11:40 AM, Oleg Nesterov wrote:
>> Jens, I am sorry. I tried to understand your explanations but I can't :/
>> Just in case, I know nothing about io_uring.
>>
>> However, I strongly believe that
>>
>> 	- the "task_work_exited" check in 4/4 can't help, the kernel
>> 	  will crash anyway if a task-work callback runs with
>> 	  current->task_works == &task_work_exited.
>>
>> 	- this check is not needed with the patch I sent.
>> 	  UNLESS io_ring_ctx_wait_and_kill() can be called by the exiting
>> 	  task AFTER it passes exit_task_work(), but I don't see how this
>> 	  is possible.
>>
>> Lets forget this problem, lets assume that task_work_run() is always safe.
>>
>> I still can not understand why io_ring_ctx_wait_and_kill() needs to call
>> task_work_run().
>>
>> On 04/07, Jens Axboe wrote:
>>>
>>> io_uring exit removes the pending poll requests, but what if (for non
>>> exit invocation), we get poll requests completing before they are torn
>>> down. Now we have task_work queued up that won't get run,
>>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> this must not be possible. If task_work is queued it will run, or we
>> have another bug.
>>
>>> because we
>>> are are in the task_work handler for the __fput().
>>
>> this doesn't matter...
>>
>>> For this case, we
>>> need to run the task work.
>>
>> This is what I fail to understand :/
> 
> Actually debugging this just now to attempt to get to the bottom of it.
> I'm running with Peter's "put fput work at the end at task_work_run
> time" patch (with a head == NULL check that was missing). I get a hang
> on the wait_for_completion() on io_uring exit, and if I dump the
> task_work, this is what I get:
> 
> dump_work: dump cb
> cb=ffff88bff25589b8, func=ffffffff812f7310	<- io_poll_task_func()
> cb=ffff88bfdd164600, func=ffffffff812925e0	<- some __fput()
> cb=ffff88bfece13cb8, func=ffffffff812f7310	<- io_poll_task_func()
> cb=ffff88bff78393b8, func=ffffffff812b2c40
> 
> and we hang because io_poll_task_func() got queued twice on this task
> _after_ we yanked the current list of work.
> 
> I'm adding some more debug items to figure out why this is, just wanted
> to let you know that I'm currently looking into this and will provide
> more data when I have it.

Here's some more data. I added a WARN_ON_ONCE() for task->flags &
PF_EXITING on task_work_add() success, and it triggers with the
following backtrace:

[  628.291061] RIP: 0010:__io_async_wake+0x14a/0x160
[  628.300452] Code: 8b b8 c8 00 00 00 e8 75 df 00 00 ba 01 00 00 00 48 89 ee 48 89 c7 49 89 c6 e8 82 dd de ff e9 59 ff ff ff 0f 0b e9 52 ff ff ff <0f> 0b e9 40 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00
[  628.337930] RSP: 0018:ffffc9000c830a40 EFLAGS: 00010002
[  628.348365] RAX: 0000000000000000 RBX: ffff88bfe85fc200 RCX: ffff88bff58491b8
[  628.362610] RDX: 0000000000000001 RSI: ffff88bfe85fc2b8 RDI: ffff889fc929f000
[  628.376855] RBP: ffff88bfe85fc2b8 R08: 00000000000000c3 R09: ffffc9000c830ad0
[  628.391087] R10: 0000000000000000 R11: ffff889ff01000a0 R12: ffffc9000c830ad0
[  628.405317] R13: ffff889fb405fc00 R14: ffff889fc929f000 R15: ffff88bfe85fc200
[  628.419549] FS:  0000000000000000(0000) GS:ffff889fff5c0000(0000) knlGS:0000000000000000
[  628.435706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  628.447178] CR2: 00007f40a3c4b8f0 CR3: 0000000002409002 CR4: 00000000003606e0
[  628.461427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  628.475675] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  628.489924] Call Trace:
[  628.494810]  <IRQ>
[  628.498835]  ? __io_queue_sqe.part.97+0x750/0x750
[  628.508229]  ? tcp_ack_update_rtt.isra.55+0x113/0x430
[  628.518320]  __wake_up_common+0x71/0x140
[  628.526152]  __wake_up_common_lock+0x7c/0xc0
[  628.534681]  sock_def_readable+0x3c/0x70
[  628.542516]  tcp_data_queue+0x2d9/0xb50
[  628.550175]  tcp_rcv_established+0x1ce/0x620
[  628.558703]  ? sk_filter_trim_cap+0x4f/0x200
[  628.567232]  tcp_v6_do_rcv+0xbe/0x3b0
[  628.574545]  tcp_v6_rcv+0xa8d/0xb20
[  628.581516]  ip6_protocol_deliver_rcu+0xb4/0x450
[  628.590736]  ip6_input_finish+0x11/0x20
[  628.598396]  ip6_input+0xa0/0xb0
[  628.604845]  ? tcp_v6_early_demux+0x90/0x140
[  628.613371]  ? tcp_v6_early_demux+0xdb/0x140
[  628.621902]  ? ip6_rcv_finish_core.isra.21+0x66/0x90
[  628.631816]  ipv6_rcv+0xc0/0xd0
[  628.638092]  __netif_receive_skb_one_core+0x50/0x70
[  628.647833]  netif_receive_skb_internal+0x2f/0xa0
[  628.657226]  napi_gro_receive+0xe7/0x150
[  628.665068]  mlx5e_handle_rx_cqe+0x8c/0x100
[  628.673416]  mlx5e_poll_rx_cq+0xef/0x95b
[  628.681252]  mlx5e_napi_poll+0xe2/0x610
[  628.688913]  net_rx_action+0x132/0x360
[  628.696403]  __do_softirq+0xd3/0x2e6
[  628.703545]  irq_exit+0xa5/0xb0
[  628.709816]  do_IRQ+0x79/0xd0
[  628.715744]  common_interrupt+0xf/0xf
[  628.723057]  </IRQ>
[  628.727256] RIP: 0010:cpuidle_enter_state+0xac/0x410

which means that we've successfully added the task_work while the
process is exiting. Maybe I can work-around this by checking myself
instead of relying on task_work_add() finding work_exited on the list.

-- 
Jens Axboe