On 8/2/23 02:36, Xuan Zhuo wrote:
On Tue, 1 Aug 2023 08:45:10 -0700, Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
On Tue, 1 Aug 2023 10:57:30 +0800 Xuan Zhuo wrote:
You have this working and benchmarked or this is just and idea?
This is not just an idea. I said that has been used on large scale.
This is the library for the APP to use the AF_XDP. We has open it.
https://gitee.com/anolis/libxudp
This is the Alibaba version of the nginx. That has been opened, that supported
to work with the libray to use AF_XDP.
http://tengine.taobao.org/
I supported this on our kernel release Anolis/Alinux.
Interesting!
The work was done about 2 years ago. You know, I pushed the first version to
enable AF_XDP on virtio-net about two years ago. I never thought the job would
be so difficult.
Me neither, but it is what it is.
The nic (virtio-net) of AliYun can reach 24,000,000PPS.
So I think there is no different with the real HW on the performance.
With the AF_XDP, the UDP pps is seven times that of the kernel udp stack.
UDP pps or QUIC pps? UDP with or without GSO?
UDP PPS without GSO.
Do you have measurements of how much it saves in real world workloads?
I'm asking mostly out of curiosity, not to question the use case.
YES,the result is affected by the request size, we can reach 10-40%.
The smaller the request size, the lower the result.
What about io_uring zero copy w/ pre-registered buffers.
You'll get csum offload, GSO, all the normal perf features.
We tried io-uring, but it was not suitable for our scenario.
Yes, now the AF_XDP does not support the csum offload and GSO.
This is indeed a small problem.
Can you say more about io-uring suitability? It can do zero copy
and recently-ish Pavel optimized it quite a bit.
First, AF_XDP is also zero-copy. We also use XDP for a few things.
And this was all about two years ago, so we have to say something about io-uring
two years ago.
As far as I know, io-uring still use kernel udp stack, AF_XDP can
skip all kernel stack directly to driver.
So here, io-ring does not have too much advantage.
Unfortunately I'd agree. Most of it is in the net stack. It can be
optimised to a certain extent (IMHO far more modest than 7x) but would
need extensive reworking, and I don't think I saw any appetite for that
--
Pavel Begunkov