We are facing a performance issue on XFS and other filesystems running on fast NVMe drives when reading large amounts of data through the page cache with fio. Streaming read performance starts off near the NVMe hardware limit until around the total size of system memory worth of data has been read. Performance then drops to around half the hardware limit and CPU load increases significantly. Using perf, we were able to establish that most of the CPU load is caused by a spin lock in native_queued_spin_lock_slowpath: - 58,93% 58,92% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 45,72% __libc_read entry_SYSCALL_64_after_hwframe do_syscall_64 ksys_read vfs_read new_sync_read xfs_file_read_iter xfs_file_buffered_aio_read - generic_file_read_iter - 45,72% ondemand_readahead - __do_page_cache_readahead - 34,64% __alloc_pages_nodemask - 34,34% __alloc_pages_slowpath - 34,33% try_to_free_pages do_try_to_free_pages - shrink_node - 34,33% shrink_lruvec - shrink_inactive_list - 28,22% shrink_page_list - 28,10% __remove_mapping - 28,10% _raw_spin_lock_irqsave native_queued_spin_lock_slowpath + 6,10% _raw_spin_lock_irq + 11,09% read_pages When direct I/O is used, hardware level read throughput is sustained during the entire experiment and CPU load stays low. Threads stay in D state most of the time. Very similar results are described around half-way through this article [1]. Is this a known issue with the page cache and high throughput I/O? Is there any tuning that can be applied to get around the CPU bottleneck? We have tried disabling readahead on the drives, which lead to very bad throughput (~-90%). Various other scheduler related tuning was tried as well but the results were always similar. Experiment setup can be found below. I am happy to provide more detail if required. If this is the wrong place to post this, please kindly let me know. Best regards - Philipp Experiment setup: [1] https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/ CPU: 2x Intel(R) Xeon(R) Platinum 8352Y 2.2 GHz, 32c/64t each, 512GB memory NVMe: 16x 1.6TB, 8 per NUMA node FS: one XFS per disk, but reproducible on ext4 and ZFS Kernel: Linux 5.3 (SLES), but reproducible on 5.12 (SUSE Tumbleweed) NVMe scheduler: both "none" and "mq-deadline", very similar results fio: 4 threads per NVMe drive, 20GiB of data per thread, ioengine=sync Sustained read throughput direct=1: ~52GiB/s (~3.2 GiB/s*disk) Sustained read throughput direct=0: ~25GiB/s (~1.5 GiB/s*disk)