On Mon, Sep 23, 2024 at 07:26:31PM -0700, Linus Torvalds wrote: > On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > However, the problematic workload is cold cache operations where > > the dentry cache repeatedly misses. This places all the operational > > concurrency directly on the inode hash as new inodes are inserted > > into the hash. Add memory reclaim and that adds contention as it > > removes inodes from the hash on eviction. > > Yeah, and then we spend all the time just adding the inodes to the > hashes, and probably fairly seldom use them. Oh well. > > And I had missed the issue with PREEMPT_RT and the fact that right now > the inode hash lock is outside the inode lock, which is problematic. *nod* > So it's all a bit nasty. > > But I also assume most of the bad issues end up mainly showing up on > just fairly synthetic benchmarks with ramdisks, because even with a > good SSD I suspect the IO for the cold cache would still dominate? No, all the issues show up on consumer level NVMe SSDs - they have more than enough IO concurrency to cause these CPU concurrency related problems. Keep in mind that when it comes to doing huge amounts of IO, ramdisks are fundamentally flawed and don't scale. That is, the IO is synchonous memcpy() based and so consumes CPU time and both read and write memory bandwidth, and concurrency is limited to the number of CPUs in the system.. With NVMe SSDs, all the data movement is asynchronous and offloaded to hardware with DMA engines that move the data. Those DMA engines can often handle hundreds of concurrent IOs at once. DMA sourced data is also only written to RAM once, and there are no dependent data reads to slow down the DMA write to RAM like there is with a data copy streamed through a CPU. IOWs, once the IO rates and concurrency go up, it is generally much faster to use the CPU to program the DMA engines to move the data than it is to move the data with the CPU itself. The testing I did (and so the numbers in those benchmarks) was done on 2018-era PCIe 3.0 enterprise NVMe SSDs that could do approximately 400k 4kB random read IOPS. The latest consumer PCIe 5.0 NVMe SSDs are *way faster* than these drives when subject to highly concurrent IO requests... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx