Recently we discussed the scalability issues while running large instances of FIO with buffered IO option on NVME block devices here: https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@xxxxxxx/ One of the suggestions Chris Mason gave (during private discussions) was to enable large folios in block buffered IO path as that could improve the scalability problems and improve the lock contention scenarios. This is an attempt to check the feasibility and potential benefit of the same. To keep changes to minimum and also to non-disruptively test this for the required block device only, I have added an ioctl to set large folios support on block device mapping. I understand that this is not the right way to do this but this is just an experiment to evaluate the potential benefit. Experimental setup ------------------ 2 node EPYC server based Zen5 server with 512G memory in each node. Disk layout for FIO: nvme2n1 259:12 0 3.5T 0 disk ├─nvme2n1p1 259:13 0 894.3G 0 part ├─nvme2n1p2 259:14 0 894.3G 0 part ├─nvme2n1p3 259:15 0 894.3G 0 part └─nvme2n1p4 259:16 0 894.1G 0 part Four parallel instances of FIO are run on the above 4 partitions with the following options: -filename=/dev/nvme2n1p[1,2,3,4] -direct=0 -thread -size=800G -rw=rw -rwmixwrite=[10,30,50] --norandommap --randrepeat=0 -ioengine=sync -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting Results ------- default: Unmodified kernel and FIO. patched: Kernel with BLKSETLFOLIO ioctl(introduced in this patchset) and FIO modified to issue that ioctl. In the below table, r is READ bw and w is WRITE bw reported by FIO. default patched ro (w/o -rw=rw option) Instance 1 r=12.3GiB/s r=39.4GiB/s Instance 2 r=12.2GiB/s r=39.1GiB/s Instance 3 r=16.3GiB/s r=37.1GiB/s Instance 4 r=14.9GiB/s r=42.9GiB/s rwmixwrite=10% Instance 1 r=27.5GiB/s,w=3125MiB/s r=75.9GiB/s,w=8636MiB/s Instance 2 r=25.5GiB/s,w=2898MiB/s r=87.6GiB/s,w=9967MiB/s Instance 3 r=25.7GiB/s,w=2922MiB/s r=78.3GiB/s,w=8904MiB/s Instance 4 r=27.5GiB/s,w=3134MiB/s r=73.5GiB/s,w=8365MiB/s rwmixwrite=30% Instance 1 r=55.7GiB/s,w=23.9GiB/s r=59.2GiB/s,w=25.4GiB/s Instance 2 r=38.5GiB/s,w=16.5GiB/s r=57.6GiB/s,w=24.7GiB/s Instance 3 r=37.5GiB/s,w=16.1GiB/s r=59.5GiB/s,w=25.5GiB/s Instance 4 r=37.4GiB/s,w=16.0GiB/s r=63.3GiB/s,w=27.1GiB/s rwmixwrite=50% Instance 1 r=37.1GiB/s,w=37.1GiB/s r=40.7GiB/s,w=40.7GiB/s Instance 2 r=37.6GiB/s,w=37.6GiB/s r=45.9GiB/s,w=45.9GiB/s Instance 3 r=35.1GiB/s,w=35.1GiB/s r=49.2GiB/s,w=49.2GiB/s Instance 4 r=43.6GiB/s,w=43.6GiB/s r=41.2GiB/s,w=41.2GiB/s Summary of FIO throughput ------------------------- - Significant increase(3x) in bandwidth for ro case. - Significant increase(3x) in bandwidth for rw 10%. - Good gains(~1.15 to 1.5x) for 30% and 50%. perf-lock contention output --------------------------- The lock contention data doesn't look all that conclusive but for 30% rwmixwrite mix it looks like this: perf-lock contention default contended total wait max wait avg wait type caller 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 0xffffffff8f39e88f up_write+0x4f 0xffffffff8f9d598e blkdev_llseek+0x4e 0xffffffff8f703322 ksys_lseek+0x72 0xffffffff8f7033a8 __x64_sys_lseek+0x18 0xffffffff8f20b983 x64_sys_call+0x1fb3 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c 0xffffffff903f18fb down_write+0x5b 0xffffffff8f9d5971 blkdev_llseek+0x31 0xffffffff8f703322 ksys_lseek+0x72 0xffffffff8f7033a8 __x64_sys_lseek+0x18 0xffffffff8f20b983 x64_sys_call+0x1fb3 0xffffffff903dce5e do_syscall_64+0x7e 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 134057198 14.27 h 35.93 ms 383.14 us spinlock clear_shadow_entries+0x57 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5c7f _raw_spin_lock+0x3f 0xffffffff8f5e7967 clear_shadow_entries+0x57 0xffffffff8f5e90e3 mapping_try_invalidate+0x163 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 0xffffffff8f9d3872 invalidate_bdev+0x42 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae 0xffffffff8f9faea1 blkdev_ioctl+0xc1 33351524 1.76 h 35.86 ms 190.43 us spinlock __remove_mapping+0x5d 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5c7f _raw_spin_lock+0x3f 0xffffffff8f5ec71d __remove_mapping+0x5d 0xffffffff8f5f9be6 remove_mapping+0x16 0xffffffff8f5e8f5b mapping_evict_folio+0x7b 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 0xffffffff8f9d3872 invalidate_bdev+0x42 9448820 14.96 m 1.54 ms 95.01 us spinlock folio_lruvec_lock_irqsave+0x64 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64 0xffffffff8f5e587c folio_batch_move_lru+0x5c 0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1 0xffffffff8f5e7593 deactivate_file_folio+0x43 0xffffffff8f5e90b7 mapping_try_invalidate+0x137 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 1488531 11.07 m 1.07 ms 446.39 us spinlock try_to_free_buffers+0x56 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5c7f _raw_spin_lock+0x3f 0xffffffff8f768c76 try_to_free_buffers+0x56 0xffffffff8f5cf647 filemap_release_folio+0x87 0xffffffff8f5e8f4c mapping_evict_folio+0x6c 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 0xffffffff8f9d3872 invalidate_bdev+0x42 2556868 6.78 m 474.72 us 159.07 us spinlock blkdev_llseek+0x31 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 0xffffffff903f14c4 rwsem_down_write_slowpath+0x274 0xffffffff903f18fb down_write+0x5b 0xffffffff8f9d5971 blkdev_llseek+0x31 0xffffffff8f703322 ksys_lseek+0x72 0xffffffff8f7033a8 __x64_sys_lseek+0x18 0xffffffff8f20b983 x64_sys_call+0x1fb3 2512627 3.75 m 450.96 us 89.55 us spinlock blkdev_llseek+0x31 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 0xffffffff903f12f0 rwsem_down_write_slowpath+0xa0 0xffffffff903f18fb down_write+0x5b 0xffffffff8f9d5971 blkdev_llseek+0x31 0xffffffff8f703322 ksys_lseek+0x72 0xffffffff8f7033a8 __x64_sys_lseek+0x18 0xffffffff8f20b983 x64_sys_call+0x1fb3 908184 1.52 m 439.58 us 100.58 us spinlock blkdev_llseek+0x31 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5d01 _raw_spin_lock_irq+0x51 0xffffffff903f1367 rwsem_down_write_slowpath+0x117 0xffffffff903f18fb down_write+0x5b 0xffffffff8f9d5971 blkdev_llseek+0x31 0xffffffff8f703322 ksys_lseek+0x72 0xffffffff8f7033a8 __x64_sys_lseek+0x18 0xffffffff8f20b983 x64_sys_call+0x1fb3 134 1.48 m 1.22 s 663.88 ms mutex bdev_release+0x69 0xffffffff903ef1de __mutex_lock.constprop.0+0x17e 0xffffffff903ef863 __mutex_lock_slowpath+0x13 0xffffffff903ef8bb mutex_lock+0x3b 0xffffffff8f9d5249 bdev_release+0x69 0xffffffff8f9d5921 blkdev_release+0x11 0xffffffff8f7089f3 __fput+0xe3 0xffffffff8f708c9b __fput_sync+0x1b 0xffffffff8f6fe8ed __x64_sys_close+0x3d perf-lock contention patched contended total wait max wait avg wait type caller 1153627 40.15 h 48.67 s 125.30 ms rwsem:W blkdev_llseek+0x31 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c 0xffffffff903f18fb down_write+0x5b 0xffffffff8f9d5971 blkdev_llseek+0x31 0xffffffff8f703322 ksys_lseek+0x72 0xffffffff8f7033a8 __x64_sys_lseek+0x18 0xffffffff8f20b983 x64_sys_call+0x1fb3 0xffffffff903dce5e do_syscall_64+0x7e 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 276512439 39.19 h 46.90 ms 510.22 us spinlock clear_shadow_entries+0x57 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5c7f _raw_spin_lock+0x3f 0xffffffff8f5e7967 clear_shadow_entries+0x57 0xffffffff8f5e90e3 mapping_try_invalidate+0x163 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 0xffffffff8f9d3872 invalidate_bdev+0x42 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae 0xffffffff8f9faea1 blkdev_ioctl+0xc1 763119320 26.37 h 887.44 us 124.38 us spinlock rwsem_wake.isra.0+0x42 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 0xffffffff8f39e88f up_write+0x4f 0xffffffff8f9d598e blkdev_llseek+0x4e 0xffffffff8f703322 ksys_lseek+0x72 0xffffffff8f7033a8 __x64_sys_lseek+0x18 0xffffffff8f20b983 x64_sys_call+0x1fb3 33263910 2.87 h 29.43 ms 310.56 us spinlock __remove_mapping+0x5d 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5c7f _raw_spin_lock+0x3f 0xffffffff8f5ec71d __remove_mapping+0x5d 0xffffffff8f5f9be6 remove_mapping+0x16 0xffffffff8f5e8f5b mapping_evict_folio+0x7b 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 0xffffffff8f9d3872 invalidate_bdev+0x42 58671816 2.50 h 519.68 us 153.45 us spinlock folio_lruvec_lock_irqsave+0x64 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64 0xffffffff8f5e587c folio_batch_move_lru+0x5c 0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1 0xffffffff8f5e7593 deactivate_file_folio+0x43 0xffffffff8f5e90b7 mapping_try_invalidate+0x137 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 284 22.33 m 5.35 s 4.72 s mutex bdev_release+0x69 0xffffffff903ef1de __mutex_lock.constprop.0+0x17e 0xffffffff903ef863 __mutex_lock_slowpath+0x13 0xffffffff903ef8bb mutex_lock+0x3b 0xffffffff8f9d5249 bdev_release+0x69 0xffffffff8f9d5921 blkdev_release+0x11 0xffffffff8f7089f3 __fput+0xe3 0xffffffff8f708c9b __fput_sync+0x1b 0xffffffff8f6fe8ed __x64_sys_close+0x3d 2181469 21.38 m 1.15 ms 587.98 us spinlock try_to_free_buffers+0x56 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5c7f _raw_spin_lock+0x3f 0xffffffff8f768c76 try_to_free_buffers+0x56 0xffffffff8f5cf647 filemap_release_folio+0x87 0xffffffff8f5e8f4c mapping_evict_folio+0x6c 0xffffffff8f5e9068 mapping_try_invalidate+0xe8 0xffffffff8f5e9160 invalidate_mapping_pages+0x10 0xffffffff8f9d3872 invalidate_bdev+0x42 454398 4.22 m 37.54 ms 557.13 us spinlock __remove_mapping+0x5d 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f5c7f _raw_spin_lock+0x3f 0xffffffff8f5ec71d __remove_mapping+0x5d 0xffffffff8f5f4f04 shrink_folio_list+0xbc4 0xffffffff8f5f5a6b evict_folios+0x34b 0xffffffff8f5f772f try_to_shrink_lruvec+0x20f 0xffffffff8f5f79ef shrink_one+0x10f 0xffffffff8f5fb975 shrink_node+0xb45 773 3.53 m 2.60 s 273.76 ms mutex __lru_add_drain_all+0x3a 0xffffffff903ef1de __mutex_lock.constprop.0+0x17e 0xffffffff903ef863 __mutex_lock_slowpath+0x13 0xffffffff903ef8bb mutex_lock+0x3b 0xffffffff8f5e3d7a __lru_add_drain_all+0x3a 0xffffffff8f5e77a0 lru_add_drain_all+0x10 0xffffffff8f9d3861 invalidate_bdev+0x31 0xffffffff8f9fac3e blkdev_common_ioctl+0x9ae 0xffffffff8f9faea1 blkdev_ioctl+0xc1 1997851 3.09 m 651.65 us 92.83 us spinlock folio_lruvec_lock_irqsave+0x64 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c 0xffffffff8f6e3ed4 folio_lruvec_lock_irqsave+0x64 0xffffffff8f5e587c folio_batch_move_lru+0x5c 0xffffffff8f5e5a41 __folio_batch_add_and_move+0xd1 0xffffffff8f5e5ae4 folio_add_lru+0x54 0xffffffff8f5d075d filemap_add_folio+0xcd 0xffffffff8f5e30c0 page_cache_ra_order+0x220 Observations from perf-lock contention -------------------------------------- - Significant reduction of contention for inode_lock (inode->i_rwsem) from blkdev_llseek() path. - Significant increase in contention for inode->i_lock from invalidate and remove_mapping paths. - Significant increase in contention for lruvec spinlock from deactive_file_folio path. Request comments on the above and I am specifically looking for inputs on these: - Lock contention results and usefulness of large folios in bringing down the contention in this specific case. - If enabling large folios in block buffered IO path is a feasible approach, inputs on doing this cleanly and correclty. Bharata B Rao (1): block/ioctl: Add an ioctl to enable large folios for block buffered IO path block/ioctl.c | 8 ++++++++ include/uapi/linux/fs.h | 2 ++ 2 files changed, 10 insertions(+) -- 2.34.1