* Matthew Wilcox <willy@xxxxxxxxxxxxx> [210622 19:37]: > Yes, this is a known issue. Here's what's happening: > > - The machine hits its low memory watermarks and starts trying to > reclaim. There's one kswapd per node, so both nodes go to work > trying to reclaim memory (each kswapd tries to handle the memory > attached to its node) > - But all the memory is allocated to the same file, so both kswapd > instances try to remove the pages from the same file, and necessarily > contend on the same spinlock. > - The process trying to stream the file is also trying to acquire this > spinlock in order to add its newly-allocated pages to the file. > Thank you for the detailed explanation. In this benchmark scenario, every thread (4 per NVMe drive) uses its own file, so there are reads from 64 files in flight at the same time. The individual files are only 20GiB in size so the kswapd instances must handle memory allocated to multiple files at once, right? But both kswapd instances are probably contending for the same spinlocks on multiple of those files then. > What you can do is force the page cache to only allocate memory from the > local node. That means this workload will only use half the memory in > the machine, but it's a streaming workload, so that shouldn't matter? > > The only problem is, I'm not sure what the user interface is to make > that happen. Here's what it looks like inside the kernel: I repeated the benchmark and bound the fio threads to the numa nodes their specific disks are connected to. I also forced the memory allocation to be local to those numa zones and confirmed that cache allocation really only happens on half of the memory when only the threads on one numa zone run. Not sure if that is enough to achieve that only one kswapd will be actively trying to remove pages. In both cases (only threads on one numa zone running and numa bound threads on both numa zones running) the throughput drop occured when half/all of the memory was exhausted. Does that mean that it isn't the two kswapd threads contending for the locks but the process itself and the local kswapd? Is there anything else we could do to improve that situation? - Philipp