On 4/3/21 12:58 AM, Hao Xu wrote: > 在 2021/4/2 上午6:29, Pavel Begunkov 写道: >> On 01/04/2021 15:55, Hao Xu wrote: >>> 在 2021/4/1 下午6:25, Pavel Begunkov 写道: >>>> On 01/04/2021 07:53, Hao Xu wrote: >>>>> 在 2021/4/1 上午6:06, Pavel Begunkov 写道: >>>>>> >>>>>> >>>>>> On 31/03/2021 10:01, Hao Xu wrote: >>>>>>> Now that we have multishot poll requests, one sqe can emit multiple >>>>>>> cqes. given below example: >>>>>>> sqe0(multishot poll)-->sqe1-->sqe2(drain req) >>>>>>> sqe2 is designed to issue after sqe0 and sqe1 completed, but since sqe0 >>>>>>> is a multishot poll request, sqe2 may be issued after sqe0's event >>>>>>> triggered twice before sqe1 completed. This isn't what users leverage >>>>>>> drain requests for. >>>>>>> Here a simple solution is to ignore all multishot poll cqes, which means >>>>>>> drain requests won't wait those request to be done. >>>>>>> >>>>>>> Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> >>>>>>> --- >>>>>>> fs/io_uring.c | 9 +++++++-- >>>>>>> 1 file changed, 7 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>> index 513096759445..cd6d44cf5940 100644 >>>>>>> --- a/fs/io_uring.c >>>>>>> +++ b/fs/io_uring.c >>>>>>> @@ -455,6 +455,7 @@ struct io_ring_ctx { >>>>>>> struct callback_head *exit_task_work; >>>>>>> struct wait_queue_head hash_wait; >>>>>>> + unsigned multishot_cqes; >>>>>>> /* Keep this last, we don't need it for the fast path */ >>>>>>> struct work_struct exit_work; >>>>>>> @@ -1181,8 +1182,8 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq) >>>>>>> if (unlikely(req->flags & REQ_F_IO_DRAIN)) { >>>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>>> - return seq != ctx->cached_cq_tail >>>>>>> - + READ_ONCE(ctx->cached_cq_overflow); >>>>>>> + return seq + ctx->multishot_cqes != ctx->cached_cq_tail >>>>>>> + + READ_ONCE(ctx->cached_cq_overflow); >>>>>>> } >>>>>>> return false; >>>>>>> @@ -4897,6 +4898,7 @@ static bool io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) >>>>>>> { >>>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>>> unsigned flags = IORING_CQE_F_MORE; >>>>>>> + bool multishot_poll = !(req->poll.events & EPOLLONESHOT); >>>>>>> if (!error && req->poll.canceled) { >>>>>>> error = -ECANCELED; >>>>>>> @@ -4911,6 +4913,9 @@ static bool io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) >>>>>>> req->poll.done = true; >>>>>>> flags = 0; >>>>>>> } >>>>>>> + if (multishot_poll) >>>>>>> + ctx->multishot_cqes++; >>>>>>> + >>>>>> >>>>>> We need to make sure we do that only for a non-final complete, i.e. >>>>>> not killing request, otherwise it'll double account the last one. >>>>> Hi Pavel, I saw a killing request like iopoll_remove or async_cancel call io_cqring_fill_event() to create an ECANCELED cqe for the original poll request. So there could be cases like(even for single poll request): >>>>> (1). add poll --> cancel poll, an ECANCELED cqe. >>>>> 1sqe:1cqe all good >>>>> (2). add poll --> trigger event(queued to task_work) --> cancel poll, an ECANCELED cqe --> task_work runs, another ECANCELED cqe. >>>>> 1sqe:2cqes >>>> >>>> Those should emit a CQE on behalf of the request they're cancelling >>>> only when it's definitely cancelled and not going to fill it >>>> itself. E.g. if io_poll_cancel() found it and removed from >>>> all the list and core's poll infra. >>>> >>>> At least before multi-cqe it should have been working fine. >>>> >>> I haven't done a test for this, but from the code logic, there could be >>> case below: >>> >>> io_poll_add() | io_poll_remove >>> (event happen)io_poll_wake() | io_poll_remove_one >>> | io_poll_remove_waitqs >>> | io_cqring_fill_event(-ECANCELED) >>> | >>> task_work run(io_poll_task_func) | >>> io_poll_complete() | >>> req->poll.canceled is true, \ | >>> __io_cqring_fill_event(-ECANCELED) | >>> >>> two ECANCELED cqes, is there anything I missed? >> >> Definitely may be be, but need to take a closer look >> > I'll do some test to test if this issue exists, and make some change if > it does. How about something like this? Seems pointless to have an extra variable for this, when we already track if we're going to do more completions for this event or not. Also places the variable where it makes the most sense, and plenty of pad space there too. Warning: totally untested. Would be great if you could, and hoping you're going to send out a v2. diff --git a/fs/io_uring.c b/fs/io_uring.c index f94b32b43429..1eea4998ad9b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -423,6 +423,7 @@ struct io_ring_ctx { unsigned cq_mask; atomic_t cq_timeouts; unsigned cq_last_tm_flush; + unsigned cq_extra; unsigned long cq_check_overflow; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; @@ -1183,8 +1184,8 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq) if (unlikely(req->flags & REQ_F_IO_DRAIN)) { struct io_ring_ctx *ctx = req->ctx; - return seq != ctx->cached_cq_tail - + READ_ONCE(ctx->cached_cq_overflow); + return seq + ctx->cq_extra != ctx->cached_cq_tail + + READ_ONCE(ctx->cached_cq_overflow); } return false; @@ -4894,6 +4895,9 @@ static bool io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) req->poll.done = true; flags = 0; } + if (flags & IORING_CQE_F_MORE) + ctx->cq_extra++; + io_commit_cqring(ctx); return !(flags & IORING_CQE_F_MORE); } -- Jens Axboe