On 12/1/21 03:10, David Ahern wrote:
On 11/30/21 8:18 AM, Pavel Begunkov wrote:
Early proof of concept for zerocopy send via io_uring. This is just
an RFC, there are details yet to be figured out, but hope to gather
some feedback.
Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
or 2.44 times faster.
№ | test: | BW (MB/s) | speedup
1 | msg_zerocopy (non-zc) | 18281 | 0.38
2 | msg_zerocopy -z (baseline) | 47421 | 1
3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03
4 | io_uring (@flush=true, nr_reqs=1) | 89310 | 1.88
5 | io_uring (@flush=false, nr_reqs=8) | 116079 | 2.44
6 | io_uring (@flush=true, nr_reqs=8) | 109722 | 2.31
Based on selftests/.../msg_zerocopy but more limited. You can use
msg_zerocopy -r as usual for receive side.
...
Can you state the exact command lines you are running for all of the
commands? I tried this set (and commands referenced below) and my
Sure. First, for dummy I set mtu by hand, not sure can do it from
the userspace, can I? Without it __ip_append_data() falls into
non-zerocopy path.
diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index f82ad7419508..5c5aeacdabd5 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -132,7 +132,8 @@ static void dummy_setup(struct net_device *dev)
eth_hw_addr_random(dev);
dev->min_mtu = 0;
- dev->max_mtu = 0;
+ dev->mtu = 0xffff;
+ dev->max_mtu = 0xffff;
}
# dummy configuration
modprobe dummy numdummies=1
ip link set dummy0 up
# force requests to <dummy_ip_addr> go through the dummy device
ip route add <dummy_ip_addr> dev dummy0
With dummy I was just sinking the traffic to the dummy device,
was good enough for me. Omitting "taskset" and "nice":
send-zc -4 -D <dummy_ip_addr> -t 10 udp
Similarly with msg_zerocopy:
<kernel>/tools/testing/selftests/net/msg_zerocopy -4 -p 6666 -D <dummy_ip_addr> -t 10 -z udp
For loopback testing, as zerocopy is not allowed for it as Willem explained in
the original MSG_ZEROCOPY cover-letter, I used a hack to bypass it:
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ebb12a7d386d..42df33b175ce 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2854,9 +2854,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
{
- if (likely(!skb_zcopy(skb)))
- return 0;
- return skb_copy_ubufs(skb, gfp_mask);
+ return skb_orphan_frags(skb, gfp_mask);
}
/**
Then running those two lines below in parallel and looking for the numbers
send shows. It was in favor of io_uring for me, but don't remember
exactly. perf shows that "send-zc" spends lot of time receiving, so
wasn't testing performance of it after some point.
msg_zerocopy -r -v -4 -t 20 udp
send-zc -4 -D 127.0.0.1 -t 10 udp
mileage varies quite a bit.
Interesting, any brief notes on the setup and the results? Dummy
or something real? io_uring doesn't show if it was really zerocopied
or not, but I assume you checked it (e.g. with perf/bpftrace).
I expected that @flush=true might be worse with real devices,
there is one spot to be patched, but apart from that and
cycles spend in a real LLD offseting the overhead, didn't
anticipate any problems. I'll see once I try a real device.
Also, have you run this proposed change (and with TCP) across nodes
(ie., not just local process to local process via dummy interface)?
Not yet, I tried dummy, and localhost UDP as per above and similarly
TCP. Just need to grab a server with a proper NIC, will try it out
soon.
Benchmark:
https://github.com/isilence/liburing.git zc_v1
or this file in particular:
https://github.com/isilence/liburing/blob/zc_v1/test/send-zc.c
To run the benchmark:
```
cd <liburing_dir> && make && cd test
# ./send-zc -4 [-p <port>] [-s <payload_size>] -D <destination> udp
./send-zc -4 -D 127.0.0.1 udp
```
msg_zerocopy can be used for the server side, e.g.
```
cd <linux-kernel>/tools/testing/selftests/net && make
./msg_zerocopy -4 -r [-p <port>] [-t <sec>] udp
```
--
Pavel Begunkov