Re: [PATCH for-next] io_uring: fix CQE reordering

Pavel Begunkov <asml.silence@xxxxxxxxx> · Fri, 23 Sep 2022 15:26:14 +0100

On 9/23/22 15:19, Jens Axboe wrote:
On 9/23/22 7:53 AM, Pavel Begunkov wrote:
Overflowing CQEs may result in reordeing, which is buggy in case of
links, F_MORE and so.

Reported-by: Dylan Yudaken <dylany@xxxxxx>
Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx>
---
  io_uring/io_uring.c | 12 ++++++++++--
  io_uring/io_uring.h | 12 +++++++++---
  2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index f359e24b46c3..62d1f55fde55 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -609,7 +609,7 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
  
  	io_cq_lock(ctx);
  	while (!list_empty(&ctx->cq_overflow_list)) {
-		struct io_uring_cqe *cqe = io_get_cqe(ctx);
+		struct io_uring_cqe *cqe = io_get_cqe_overflow(ctx, true);
  		struct io_overflow_cqe *ocqe;
  
  		if (!cqe && !force)
@@ -736,12 +736,19 @@ bool io_req_cqe_overflow(struct io_kiocb *req)
   * control dependency is enough as we're using WRITE_ONCE to
   * fill the cq entry
   */
-struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx)
+struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
  {
  	struct io_rings *rings = ctx->rings;
  	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
  	unsigned int free, queued, len;
  
+	/*
+	 * Posting into the CQ when there are pending overflowed CQEs may break
+	 * ordering guarantees, which will affect links, F_MORE users and more.
+	 * Force overflow the completion.
+	 */
+	if (!overflow && (ctx->check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)))
+		return NULL;

Rather than pass this bool around for the hot path, why not add a helper
for the case where 'overflow' isn't known? That can leave the regular
io_get_cqe() avoiding this altogether.

Was choosing from two ugly-ish solutions, but io_get_cqe() should be
inline and shouldn't really matter, but that's only the case in theory
though. If someone cleans up the CQE32 part and puts it into a separate
non-inline function, it'll be actually inlined.

--
Pavel Begunkov