On 25/10/2019 18:32, Jens Axboe wrote: > On 10/25/19 3:48 AM, Pavel Begunkov wrote: >> In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL: >> >> @inflight count returned by io_submit_sqes() is the number of entries >> picked up from a sqring including already completed/failed. And >> completed but not failed requests will be placed into @poll_list. >> >> Then io_sq_thread() tries to poll @inflight events, even though failed >> won't appear in @poll_list. Thus, it will think that there are always >> something to poll (i.e. @inflight > 0) >> >> There are several issues with this: >> 1. io_sq_thread won't ever sleep >> 2. io_sq_thread() may be left running and actively polling even after >> user process is destroyed >> 3. the same goes for mm_struct with all vmas of the user process >> TL;DR; >> awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking >> recycling of vmas used for rings' mapping, which hold refcount of >> io_uring's struct file. Thus, io_uring_release() won't be called, as >> well as kthread_{park,stop}(). That's all in case when the user process >> haven't unmapped rings. >> >> >> I'm not sure how to fix it better: >> 1. try to put failed into poll_list (grabbing mutex). >> >> 2. test for zero-inflight case with comparing sq and cq. something like >> ``` >> if (nr_polled == 0) { >> lock(comp_lock); >> if (cached_cq_head == cached_sq_tail) >> inflight = 0; >> unlock(comp_lock); >> } >> ``` >> But that's adds extra spinlock locking in fast-path. And that's unsafe >> to use non-cached heads/tails, as it could be maliciously changed by >> userspace. >> >> 3. Do some counting of failed (probably needs atomic or synchronisation) >> >> 4. something else? > > Can we just look at the completion count? Ala: > > prev_tail = ctx->cached_cq_tail; > inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL, > mm_fault); > if (prev_tail != ctx->cached_cq_tail) > inflight -= (ctx->cached_cq_tail - prev_tail); > > or something like that. > Don't think so: 1. @cached_cq_tail is protected be @completion_lock. (right?) Never know what happens, when you violate a memory model. 2. if something is successfully completed by that time, we again get the wrong number. Basically, it's inflight = (cached_sq_head - cached_cq_tail) + len(poll_list) maybe you can figure out something from this. idea 1: How about to count failed events and subtract it? But as they may fail asynchronously need synchronisation e.g. atomic_add() for fails (fail, slow-path) and atomic_load() in kthread (fast-path) BTW, tested the patch below before, it fixes the issue, but is racy for the same reason 1. diff --git a/fs/io_uring.c b/fs/io_uring.c index 32f6598ecae9..0353d374a0d5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2650,6 +2650,10 @@ static int io_sq_thread(void *data) bool mm_fault = false; unsigned int to_submit; + if (ctx->cached_sq_head == ctx->cached_cq_tail + + ctx->rings->sq_dropped) + inflight = 0; + if (inflight) { unsigned nr_events = 0; -- Yours sincerely, Pavel Begunkov
Attachment:
signature.asc
Description: OpenPGP digital signature