Maybe I was wrong, but according to Michael's comment it looks like he
want
check affinity_hint_set just for speculative tx polling on rx napi
instead
of disabling it at all.
And I'm not convinced this is really needed, driver only provide affinity
hint instead of affinity, so it's not guaranteed that tx and rx interrupt
are in the same vcpus.
You're right. I made the restriction broader than the request, to really
err
on the side of caution for the initial merge of napi tx. And enabling
the optimization is always a win over keeping it off, even without irq
affinity.
The cycle cost is significant without affinity regardless of whether the
optimization is used.
Though this is not limited to napi-tx, it is more
pronounced in that mode than without napi.
1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:
upstream:
1,1,1: 28985 Mbps, 278 Gcyc
1,0,2: 30067 Mbps, 402 Gcyc
napi tx:
1,1,1: 34492 Mbps, 269 Gcyc
1,0,2: 36527 Mbps, 537 Gcyc (!)
1,0,1: 36269 Mbps, 394 Gcyc
1,0,0: 34674 Mbps, 402 Gcyc
This is a particularly strong example. It is also representative
of most RR tests. It is less pronounced in other streaming tests.
10x TCP_RR, for instance:
upstream:
1,1,1: 42267 Mbps, 301 Gcyc
1,0,2: 40663 Mbps, 445 Gcyc
napi tx:
1,1,1: 42420 Mbps, 303 Gcyc
1,0,2: 42267 Mbps, 431 Gcyc
These numbers were obtained with the virtqueue_enable_cb_delayed
optimization after xmit_skb, btw. It turns out that moving that before
increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
100x TCP_RR a bit.