On 5/16/2019 5:37 AM, Magnus Karlsson wrote:
After a number of surprises and issues in the driver here are now the
first set of results. 64 byte packets at 40Gbit/s line rate. All
results in Mpps. Note that I just used my local system and kernel build
for these numbers so they are not performance tuned. Jesper would
likely get better results on his setup :-). Explanation follows after
the table.
Applications
method cores irqs txpush rxdrop l2fwd
---------------------------------------------------------------
r-t-c 2 y 35.9 11.2 8.6
poll 2 y 34.2 9.4 8.3
r-t-c 1 y 18.1 N/A 6.2
poll 1 y 14.6 8.4 5.9
busypoll 2 y 31.9 10.5 7.9
busypoll 1 y 21.5 8.7 6.2
busypoll 1 n 22.0 10.3 7.3
r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
and only spin on the pointers in the ring.
poll = Use the regular syscall poll()
busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
sent out.
cores == 2 means that softirq/ksoftirqd is one a different core from
the application. 2 cores are consumed in total.
cores == 1 means that both softirq/ksoftirqd and the application runs
on the same core. Only 1 core is used in total.
irqs == 'y' is the normal case. irqs == 'n' means that I have created a
new napi context with the AF_XDP queues inside that does not
have any interrupts associated with it. No other traffic goes
to this napi context.
N/A = This combination does not make sense since the application will
not yield due to run-to-completion without any syscalls
whatsoever. It works, but it crawls in the 30 Kpps
range. Creating huge rings would help, but did not do that.
The applications are the ones from the xdpsock sample application in
samples/bpf/.
Some things I had to do to get these results:
* The current buffer allocation scheme in i40e where we continuously
try to access the fill queue until we find some entries, is not
effective if we are on a single core. Instead, we try once and call
a function that sets a flag. This flag is then checked in the xsk
poll code, and if it is set we schedule napi so that it can try to
allocate some buffers from the fill ring again. Note that this flag
has to propagate all the way to user space so that the application
knows that it has to call poll(). I currently set a flag in the Rx
ring to indicate that the application should call poll() to resume
the driver. This is similar to what the io_uring in the storage
subsystem does. It is not enough to return POLLERR from poll() as
that will only work for the case when we are using poll(). But I do
that as well.
* Implemented Sridhar's suggestion on adding busy_loop_end callbacks
that terminate the busy poll loop if the Rx queue is empty or the Tx
queue is full.
* There is a race in the setup code in i40e when it is used with
busy-poll. The fact that busy-poll calls the napi_busy_loop code
before interrupts have been registered and enabled seems to trigger
some bug where nothing gets transmitted. This only happens for
busy-poll. Poll and run-to-completion only enters the napi loop of
i40e by interrupts and only then after interrupts have been enabled,
which is the last thing that is done after setup. I have just worked
around it by introducing a sleep(1) in the application for these
experiments. Ugly, but should not impact the numbers, I believe.
* The 1 core case is sensitive to the amount of work done reported
from the driver. This was not correct in the XDP code of i40e and
let to bad performance. Now it reports the correct values for
Rx. Note that i40e does not honor the napi budget on Tx and sets
that to 256, and these are not reported back to the napi
library.
Some observations:
* Cannot really explain the drop in performance for txpush when going
from 2 cores to 1. As stated before, the reporting of Tx work is not
really propagated to the napi infrastructure. Tried reporting this
in a correct manner (completely ignoring Rx for this experiment) but
the results were the same. Will dig deeper into this to screen out
any stupid mistakes.
* With the fixes above, all my driver processing is in softirq for 1
core. It never goes over to ksoftirqd. Previously when work was
reported incorrectly, this was the case. I would have liked
ksoftirqd to take over as that would have been more like a separate
thread. How to accomplish this? There might still be some reporting
problem in the driver that hinders this, but actually think it is
more correct now.
* Looking at the current results for a single core, busy poll provides
a 40% boost for Tx but only 5% for Rx. But if I instead create a
napi context without any interrupt associated with it and drive that
from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
increases only marginally from the 40% improvement as there are few
interrupts on Tx due to the completion interrupt bit being set quite
infrequently. One question I have is: what am I breaking by creating
a napi context not used by anyone else, only AF_XDP, that does not
have an interrupt associated with it?
Todo:
* Explain the drop in Tx push when going from 2 cores to 1.
* Really run a separate thread for kernel processing instead of softirq.
* What other experiments would you like to see?
Thanks for sharing the results.
For busypoll tests, i guess you may have increased the busypoll budget
to 64.
What is the busypoll timeout you are using?
Can you try a test that skips calling bpf program for queues that are
associated with af-xdp socket? I remember seeing a significant bump in
rxdrop performance with this change.
The other overhead i saw was with the dma_sync_single calls in the driver.
Thanks
Sridhar