On 2022-09-28 18:15:46 [+0200], Jason A. Donenfeld wrote: > Hi Sebastian, Hi Jason, > On Wed, Sep 28, 2022 at 02:06:45PM +0200, Sebastian Andrzej Siewior wrote: > > On 2022-09-27 12:42:33 [+0200], Jason A. Donenfeld wrote: > > … > > > This is an ordinary pattern done all over the kernel. However, Sherry > > > noticed a 10% performance regression in qperf TCP over a 40gbps > > > InfiniBand card. Quoting her message: > > > > > > > MT27500 Family [ConnectX-3] cards: > > > > Infiniband device 'mlx4_0' port 1 status: > > … > > > > While looking at the mlx4 driver, it looks like they don't use any NAPI > > handling in their interrupt handler which _might_ be the case that they > > handle more than 1k interrupts a second. I'm still curious to get that > > ACKed from Sherry's side. > > Are you sure about that? So far as I can tell drivers/net/ethernet/ > mellanox/mlx4 has plenty of napi_schedule/napi_enable and such. Or are > you looking at the infiniband driver instead? I don't really know how > these interact. I've been looking at mlx4_msi_x_interrupt() and it appears that it iterates over a ring buffer. I guess that mlx4_cq_completion() will invoke mlx4_en_rx_irq() which schedules NAPI. > But yea, if we've got a driver not using NAPI at 40gbps that's obviously > going to be a problem. So I'm wondering if we get 1 worker a second which kills the performance or if we get more than 1k interrupts in less than second resulting in more wakeups within a second.. > > Jason, from random's point of view: deferring until 1k interrupts + 1sec > > delay is not desired due to low entropy, right? > > Definitely || is preferable to &&. > > > > > > Rather than incur the scheduling latency from queue_work_on, we can > > > instead switch to running on the next timer tick, on the same core. This > > > also batches things a bit more -- once per jiffy -- which is okay now > > > that mix_interrupt_randomness() can credit multiple bits at once. > > > > Hmmm. Do you see higher contention on input_pool.lock? Just asking > > because if more than once CPUs invokes this timer callback aligned, then > > they block on the same lock. > > I've been doing various experiments, sending mini patches to Oracle and > having them test this in their rig. So far, it looks like the cost of > the body of the worker itself doesn't matter much, but rather the cost > of the enqueueing function is key. Still investigating though. > > It's a bit frustrating, as all I have to work with are results from the > tests, and no perf analysis. It'd be great if an engineer at Oracle was > capable of tackling this interactively, but at the moment it's just me > sending them patches. So we'll see. Getting closer though, albeit very > slowly. Oh boy. Okay. > Jason Sebastian