[CC += linux-api@xxxxxxxxxxxxxxx] Hi Willem This is a change to the kernel-user-space API. Please CC linux-api@xxxxxxxxxxxxxxx on any future iterations of this patch. Thanks, Michael On Wed, Feb 22, 2017 at 5:38 PM, Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: > From: Willem de Bruijn <willemb@xxxxxxxxxx> > > RFCv2: > > I have received a few requests for status and rebased code of this > feature. We have been running this code internally, discovering and > fixing various bugs. With net-next closed, now seems like a good time > to share an updated patchset with fixes. The rebase from RFCv1/v4.2 > was mostly straightforward: mainly iov_iter changes. Full changelog: > > RFC -> RFCv2: > - review comment: do not loop skb with zerocopy frags onto rx: > add skb_orphan_frags_rx to orphan even refcounted frags > call this in __netif_receive_skb_core, deliver_skb and tun: > the same as 1080e512d44d ("net: orphan frags on receive") > - fix: hold an explicit sk reference on each notification skb. > previously relied on the reference (or wmem) held by the > data skb that would trigger notification, but this breaks > on skb_orphan. > - fix: when aborting a send, do not inc the zerocopy counter > this caused gaps in the notification chain > - fix: in packet with SOCK_DGRAM, pull ll headers before calling > zerocopy_sg_from_iter > - fix: if sock_zerocopy_realloc does not allow coalescing, > do not fail, just allocate a new ubuf > - fix: in tcp, check return value of second allocation attempt > - chg: allocate notification skbs from optmem > to avoid affecting tcp write queue accounting (TSQ) > - chg: limit #locked pages (ulimit) per user instead of per process > - chg: grow notification ids from 16 to 32 bit > - pass range [lo, hi] through 32 bit fields ee_info and ee_data > - chg: rebased to davem-net-next on top of v4.10-rc7 > - add: limit notification coalescing > sharing ubufs limits overhead, but delays notification until > the last packet is released, possibly unbounded. Add a cap. > - tests: add snd_zerocopy_lo pf_packet test > - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug) > > The change to allocate notification skbuffs from optmem requires > ensuring that net.core.optmem is at least a few 100KB. To > experiment, run > > sysctl -w net.core.optmem_max=1048576 > > The snd_zerocopy_lo benchmarks reported in the individual patches were > rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were > replaced with skb_orphan_frags to allow looping to local sockets. The > netperf results below are also rerun with v2. > > In application load, copy avoidance shows a roughly 5% systemwide > reduction in cycles when streaming large flows and a 4-8% reduction in > wall clock time on early tensorflow test workloads. > > > Overview (from original RFC): > > Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY. > Implement the feature for TCP, UDP, RAW and packet sockets. This is > a generalization of a previous packet socket RFC patch > > http://patchwork.ozlabs.org/patch/413184/ > > On a send call with MSG_ZEROCOPY, the kernel pins the user pages and > creates skbuff fragments directly from these pages. On tx completion, > it notifies the socket owner that it is safe to modify memory by > queuing a completion notification onto the socket error queue. > > The kernel already implements such copy avoidance with vmsplice plus > splice and with ubuf_info for tun and virtio. Extend the second > with features required by TCP and others: reference counting to > support cloning (retransmit queue) and shared fragments (GSO) and > notification coalescing to handle corking. > > Notifications are queued onto the socket error queue as a range > range [N, N+m], where N is a per-socket counter incremented on each > successful zerocopy send call. > > * Performance > > The below table shows cycles reported by perf for a netperf process > sending a single 10 Gbps TCP_STREAM. The first three columns show > Mcycles spent in the netperf process context. The second three columns > show time spent systemwide (-a -C A,B) on the two cpus that run the > process and interrupt handler. Reported is the median of at least 3 > runs. std is a standard netperf, zc uses zerocopy and % is the ratio. > Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs > are disabled and the kernel is booted with idle=halt. > > NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size > > perf stat -e cycles $NETPERF > perf stat -C 2,3 -a -e cycles $NETPERF > > --process cycles-- ----cpu cycles---- > std zc % std zc % > 4K 27,609 11,217 41 49,217 39,175 79 > 16K 21,370 3,823 18 43,540 29,213 67 > 64K 20,557 2,312 11 42,189 26,910 64 > 256K 21,110 2,134 10 43,006 27,104 63 > 1M 20,987 1,610 8 42,759 25,931 61 > > Perf record indicates the main source of these differences. Process > cycles only at 1M writes (perf record; perf report -n): > > std: > Samples: 42K of event 'cycles', Event count (approx.): 21258597313 > 79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string > 3.27% 1396 netperf [kernel.kallsyms] [k] tcp_sendmsg > 1.66% 694 netperf [kernel.kallsyms] [k] get_page_from_freelist > 0.79% 325 netperf [kernel.kallsyms] [k] tcp_ack > 0.43% 188 netperf [kernel.kallsyms] [k] __alloc_skb > > zc: > Samples: 1K of event 'cycles', Event count (approx.): 1439509124 > 30.36% 584 netperf.zerocop [kernel.kallsyms] [k] gup_pte_range > 14.63% 284 netperf.zerocop [kernel.kallsyms] [k] __zerocopy_sg_from_iter > 8.03% 159 netperf.zerocop [kernel.kallsyms] [k] skb_zerocopy_add_frags_iter > 4.84% 96 netperf.zerocop [kernel.kallsyms] [k] __alloc_skb > 3.10% 60 netperf.zerocop [kernel.kallsyms] [k] kmem_cache_alloc_node > > > * Safety > > The number of pages that can be pinned on behalf of a user with > MSG_ZEROCOPY is bound by the locked memory ulimit. > > While the kernel holds process memory pinned, a process cannot safely > reuse those pages for other purposes. Packets looped onto the receive > stack and queued to a socket can be held indefinitely. Avoid unbounded > notification latency by restricting user pages to egress paths only. > skb_orphan_frags_rx() will create a private copy of pages even for > refcounted packets when these are looped, as did skb_orphan_frags for > the original tun zerocopy implementation. > > Pages are not remapped read-only. Processes can modify packet contents > while packets are in flight in the kernel path. Bytes on which kernel > control flow depends (headers) are copied to avoid TOCTTOU attacks. > Datapath integrity does not otherwise depend on payload, with three > exceptions: checksums, optional sk_filter/tc u32/.. and device + > driver logic. The effect of wrong checksums is limited to the > misbehaving process. TC filters that access contents may have to be > excluded by adding an skb_orphan_frags_rx. > > Processes can also safely avoid OOM conditions by bounding the number > of bytes passed with MSG_ZEROCOPY and by removing shared pages after > transmission from their own memory map. > > > * Limitations / Known Issues > > - PF_INET6 is not yet supported. > - TCP does not build max GSO packets, especially for > small send buffers (< 4 KB) > > Willem de Bruijn (12): > sock: allocate skbs from optmem > sock: skb_copy_ubufs support for compound pages > sock: add generic socket zerocopy > sock: enable sendmsg zerocopy > sock: sendmsg zerocopy notification coalescing > sock: sendmsg zerocopy ulimit > sock: sendmsg zerocopy limit bytes per notification > tcp: enable sendmsg zerocopy > udp: enable sendmsg zerocopy > raw: enable sendmsg zerocopy with IP_HDRINCL > packet: enable sendmsg zerocopy > test: add sendmsg zerocopy tests > > drivers/net/tun.c | 2 +- > drivers/vhost/net.c | 1 + > include/linux/sched.h | 2 +- > include/linux/skbuff.h | 94 +++- > include/linux/socket.h | 1 + > include/net/sock.h | 4 + > include/uapi/linux/errqueue.h | 1 + > net/core/datagram.c | 35 +- > net/core/dev.c | 4 +- > net/core/skbuff.c | 327 ++++++++++++-- > net/core/sock.c | 29 ++ > net/ipv4/ip_output.c | 34 +- > net/ipv4/raw.c | 27 +- > net/ipv4/tcp.c | 37 +- > net/packet/af_packet.c | 52 ++- > tools/testing/selftests/net/.gitignore | 2 + > tools/testing/selftests/net/Makefile | 1 + > tools/testing/selftests/net/snd_zerocopy.c | 354 +++++++++++++++ > tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++++++++++++++++++++++++++ > 19 files changed, 1536 insertions(+), 67 deletions(-) > create mode 100644 tools/testing/selftests/net/snd_zerocopy.c > create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c > > -- > 2.11.0.483.g087da7b7c-goog > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Author of "The Linux Programming Interface", http://blog.man7.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html