[RFC 00/16] caching and SQ/CQ optimisations

Patch 1-5 optimise io_fill_cqe_req

Patch 6-7 combine iopoll and normal completion paths

Patch 8 should improve CPU caching of SQ/CQ pointers

Patch 9 removes conditionally SQ indirection (->sq_array). Assuming we'll
make it a default in liburing, Patch 10 optimises it with static_key.

Patch 10-15 shuffle io_ring_ctx fields.

Patch 16 inlines io_fill_cqe_req.

Testing with t/io_uring nops only for now

                QD2     QD4     QD8     QD16    QD32
baseline:       17.3    26.6    36.4    43.7    49.4
Patches 1-15:   17.8    27.4    37.9    45.8    51.2
Patches 1-16:   17.9    28.2    39.3    47.8    54

L1 load misses decreased from 1.7% to 1.3%, I don't think it's
significant and it will be more interesting to see how it looks
when we do actual IO.

Pavel Begunkov (16):
  io_uring: improve cqe !tracing hot path
  io_uring: cqe init hardening
  io_uring: simplify big_cqe handling
  io_uring: refactor __io_get_cqe()
  io_uring: optimise extra io_get_cqe null check
  io_uring: reorder cqring_flush and wakeups
  io_uring: merge iopoll and normal completion paths
  io_uring: compact SQ/CQ heads/tails
  io_uring: add option to remove SQ indirection
  io_uring: static_key for !IORING_SETUP_NO_SQARRAY
  io_uring: move non aligned field to the end
  io_uring: banish non-hot data to end of io_ring_ctx
  io_uring: separate task_work/waiting cache line
  io_uring: move multishot cqe cache in ctx
  io_uring: move iopoll ctx fields around
  io_uring: force inline io_fill_cqe_req

 include/linux/io_uring_types.h | 129 ++++++++++++++++----------------
 include/uapi/linux/io_uring.h  |   5 ++
 io_uring/io_uring.c            | 130 ++++++++++++++++++---------------
 io_uring/io_uring.h            |  58 +++++++--------
 io_uring/rw.c                  |  24 ++----
 io_uring/uring_cmd.c           |   5 +-
 6 files changed, 173 insertions(+), 178 deletions(-)


