Re: Hard and soft lockups with FIO and LTP runs on a large system

Mateusz Guzik <mjguzik@xxxxxxxxx> · Sat, 20 Jul 2024 09:57:56 +0200

On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> I can't come up with any reasonable band-aid at this moment, i.e.,
> something not too ugly to work around a more fundamental scalability
> problem.
>
> Before I give up: what type of dirty data was written back to the nvme
> device? Was it page cache or swap?
>

With my corporate employee hat on, I would like to note a couple of
three things.

1. there are definitely bugs here and someone(tm) should sort them out(R)

however....

2. the real goal is presumably to beat the kernel into shape where
production kernels no longer suffer lockups running this workload on
this hardware
3. the flamegraph (to be found in [1]) shows expensive debug enabled,
notably for preemption count (search for preempt_count_sub to see)
4. I'm told the lruvec problem is being worked on (but no ETA) and I
don't think the above justifies considering any hacks or otherwise
putting more pressure on it

It is plausible eliminating the aforementioned debug will be good enough.

Apart from that I note percpu_counter_add_batch (+ irq debug) accounts
for 5.8% cpu time. This will of course go down if irq tracing is
disabled, but so happens I optimized this routine to be faster
single-threaded (in particular by dodging the interrupt trip). The
patch is hanging out in the mm tree [2] and is trivially applicable
for testing.

Even if none of the debug opts can get modified, this should drop
percpu_counter_add_batch to 1.5% or so, which may or may not have a
side effect of avoiding the lockup problem.

[1]: https://lore.kernel.org/lkml/584ecb5e-b1fc-4b43-ba36-ad396d379fad@xxxxxxx/
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=51d821654be4286b005ad2b7dc8b973d5008a2ec

-- 
Mateusz Guzik <mjguzik gmail.com>