Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> writes: > On 2024-02-14 17:08:44 [+0100], Toke Høiland-Jørgensen wrote: >> > During testing I forgot a spot in egress and the test module. You could >> > argue that the warning is enough since it should pop up in testing and >> > not production because the code is always missed and not by chance (go >> > boom, send a report). I *think* I covered all spots, at least the test >> > suite didn't point anything out to me. >> >> Well, I would prefer if we could make sure we covered everything and not >> have this odd failure mode where redirect just mysteriously stops >> working. At the very least, if we keep the check we should have a >> WARN_ON in there to make it really obvious that something needs to be >> fixed. > > Agree. > >> This brings me to another thing I was going to point out separately, but >> may as well mention it here: It would be good if we could keep the >> difference between the RT and !RT versions as small as possible to avoid >> having subtle bugs that only appear in one configuration. > > Yes. I do so, too. > >> I agree with Jesper that the concept of a stack-allocated "run context" >> for the XDP program makes sense in general (and I have some vague ideas >> about other things that may be useful to stick in there). So I'm >> wondering if it makes sense to do that even in the !RT case? We can't >> stick the pointer to it into 'current' when running in softirq, but we >> could change the per-cpu variable to just be a pointer that gets >> populated by xdp_storage_set()? > > I *think* that it could be added to current. The assignment currently > allows nesting so that is not a problem. Only the outer most set/clear > would do something. If you run in softirq, you would hijack a random > task_struct. If the pointer is already assigned then the list and so one > must be empty because access is only allowed in BH-disabled sections. > > However, using per-CPU for the pointer (instead of task_struct) would > have the advantage that it is always CPU/node local memory while the > random task_struct could have been allocated on a different NUMA node. Ah yes, good point, that's probably desirable :) >> I'm not really sure if this would be performance neutral (it's just >> moving around a few bits of memory, but we do gain an extra pointer >> deref), but it should be simple enough to benchmark. > > My guess is that we remain with one per-CPU dereference and an > additional "add offset". That is why I kept the !RT bits as they are > before getting yelled at. > > I could prepare something and run a specific test if you have one. The test itself is simple enough: Simply run xdp-bench (from xdp-tools[0]) in drop mode, serve some traffic to the machine and observe the difference in PPS before and after the patch. The tricky part is that the traffic actually has to stress the CPU, which means that the offered load has to be higher than what the CPU can handle. Which generally means running on high-speed NICs with small packets: a modern server CPU has no problem keeping up with a 10G link even at 64-byte packet size, so a 100G NIC is needed, or the test needs to be run on a low-powered machine. As a traffic generator, the xdp-trafficgen utility also in xdp-tools can be used, or the in-kernel pktgen, or something like T-rex or Moongen. Generally serving UDP traffic with 64-byte packets on a single port is enough to make sure the traffic is serviced by a single CPU, although some configuration may be needed to steer IRQs as well. -Toke [0] https://github.com/xdp-project/xdp-tools