On Fri, Jul 19, 2024 at 10:21 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > I can't come up with any reasonable band-aid at this moment, i.e., > something not too ugly to work around a more fundamental scalability > problem. > > Before I give up: what type of dirty data was written back to the nvme > device? Was it page cache or swap? > With my corporate employee hat on, I would like to note a couple of three things. 1. there are definitely bugs here and someone(tm) should sort them out(R) however.... 2. the real goal is presumably to beat the kernel into shape where production kernels no longer suffer lockups running this workload on this hardware 3. the flamegraph (to be found in [1]) shows expensive debug enabled, notably for preemption count (search for preempt_count_sub to see) 4. I'm told the lruvec problem is being worked on (but no ETA) and I don't think the above justifies considering any hacks or otherwise putting more pressure on it It is plausible eliminating the aforementioned debug will be good enough. Apart from that I note percpu_counter_add_batch (+ irq debug) accounts for 5.8% cpu time. This will of course go down if irq tracing is disabled, but so happens I optimized this routine to be faster single-threaded (in particular by dodging the interrupt trip). The patch is hanging out in the mm tree [2] and is trivially applicable for testing. Even if none of the debug opts can get modified, this should drop percpu_counter_add_batch to 1.5% or so, which may or may not have a side effect of avoiding the lockup problem. [1]: https://lore.kernel.org/lkml/584ecb5e-b1fc-4b43-ba36-ad396d379fad@xxxxxxx/ [2]: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything&id=51d821654be4286b005ad2b7dc8b973d5008a2ec -- Mateusz Guzik <mjguzik gmail.com>