Re: [PATCH net-next v3 00/25] io_uring zerocopy send

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/5/22 16:01, Pavel Begunkov wrote:

NOTE: This is not be picked directly due to cross-subsystem merge problems.
After finding a consensus and getting necessary acks, I'll work out merging
with Jakub and Jens.


The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

 From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking with an optimised version of the selftest (see [1]), which in a
loop sends a bunch of requests and then waits for their completions. "+ flush"
column posts one additional "buffer-free" notification per request, and
just "zc" doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting.

There is an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

Note: the series is based on net-next + for-5.20/io_uring, but as vanilla
net-next fails for me the repo (see [2]) is on top of for-5.20/io_uring.

Links:

   liburing (benchmark + some tests):
   [1] https://github.com/isilence/liburing/tree/zc_v3

   kernel repo:
   [2] https://github.com/isilence/linux/tree/zc_v3

   RFC v1:
   [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@xxxxxxxxx/

   RFC v2:
   https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@xxxxxxxxx/

API design overview:

   The series introduces an io_uring concept of notifactors. From the userspace
   perspective it's an entity to which it can bind one or more requests and then
   requesting to flush it. Flushing a notifier makes it impossible to attach new
   requests to it, and instructs the notifier to post a completion once all
   requests attached to it are completed and the kernel doesn't need the buffers
   anymore.

   Notifications are stored in notification slots, which should be registered as
   an array in io_uring. Each slot stores only one notifier at any particular
   moment. Flushing removes it from the slot and the slot automatically replaces
   it with a new notifier. All operations with notifiers are done by specifying
   an index of a slot it's currently in.

   When registering a notification the userspace specifies a u64 tag for each
   slot, which will be copied in notification completion entries as
   cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
   sequence number counting notifiers of a slot.

Changelog:

   RFC v2 -> v3:
     mem accounting for non-registered buffers
     allow mixing registered and normal requests per notifier
     notification flushing via IORING_OP_RSRC_UPDATE
     TCP support
     fix buffer indexing
     fix io-wq ->uring_lock locking
     fix bugs when mixing with MSG_ZEROCOPY
     fix managed refs bugs in skbuff.c

   RFC -> RFC v2:
     remove additional overhead for non-zc from skb_release_data()
     avoid msg propagation, hide extra bits of non-zc overhead
     task_work based "buffer free" notifications
     improve io_uring's notification refcounting
     added 5/19, (no pfmemalloc tracking)
     added 8/19 and 9/19 preventing small copies with zc
     misc small changes

Pavel Begunkov (25):
   ipv4: avoid partial copy for zc
   ipv6: avoid partial copy for zc
   skbuff: add SKBFL_DONT_ORPHAN flag
   skbuff: carry external ubuf_info in msghdr
   net: bvec specific path in zerocopy_sg_from_iter
   net: optimise bvec-based zc page referencing
   net: don't track pfmemalloc for managed frags
   skbuff: don't mix ubuf_info of different types
   ipv4/udp: support zc with managed data
   ipv6/udp: support zc with managed data
   tcp: support zc with managed data
   io_uring: add zc notification infrastructure
   io_uring: export task put
   io_uring: cache struct io_notif
   io_uring: complete notifiers in tw
   io_uring: add notification slot registration
   io_uring: wire send zc request type
   io_uring: account locked pages for non-fixed zc
   io_uring: allow to pass addr into sendzc
   io_uring: add rsrc referencing for notifiers
   io_uring: sendzc with fixed buffers
   io_uring: flush notifiers after sendzc
   io_uring: rename IORING_OP_FILES_UPDATE
   io_uring: add zc notification flush requests
   selftests/io_uring: test zerocopy send

  include/linux/io_uring_types.h                |  37 ++
  include/linux/skbuff.h                        |  59 +-
  include/linux/socket.h                        |   7 +
  include/uapi/linux/io_uring.h                 |  43 +-
  io_uring/Makefile                             |   2 +-
  io_uring/io_uring.c                           |  40 +-
  io_uring/io_uring.h                           |  21 +
  io_uring/net.c                                | 134 ++++
  io_uring/net.h                                |   4 +
  io_uring/notif.c                              | 215 +++++++
  io_uring/notif.h                              |  87 +++
  io_uring/opdef.c                              |  24 +-
  io_uring/rsrc.c                               |  55 +-
  io_uring/rsrc.h                               |  16 +-
  io_uring/tctx.h                               |  26 -
  net/compat.c                                  |   2 +
  net/core/datagram.c                           |  53 +-
  net/core/skbuff.c                             |  35 +-
  net/ipv4/ip_output.c                          |  63 +-
  net/ipv4/tcp.c                                |  52 +-
  net/ipv6/ip6_output.c                         |  62 +-
  net/socket.c                                  |   6 +
  tools/testing/selftests/net/Makefile          |   1 +
  .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
  .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
  25 files changed, 1652 insertions(+), 128 deletions(-)
  create mode 100644 io_uring/notif.c
  create mode 100644 io_uring/notif.h
  create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
  create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh


--
Pavel Begunkov



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux