On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@xxxxxxx> wrote: > > The contention with inode_lock is gone after your above changes. The new > top 10 contention data looks like this now: > > contended total wait max wait avg wait type caller > > 2441494015 172.15 h 1.72 ms 253.83 us spinlock > folio_wait_bit_common+0xd5 > 0xffffffffadbf60a3 > native_queued_spin_lock_slowpath+0x1f3 > 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 > 0xffffffffacdd1905 folio_wait_bit_common+0xd5 > 0xffffffffacdd2d0a filemap_get_pages+0x68a > 0xffffffffacdd2e73 filemap_read+0x103 > 0xffffffffad1d67ba blkdev_read_iter+0x6a > 0xffffffffacf06937 vfs_read+0x297 > 0xffffffffacf07653 ksys_read+0x73 > 25269947 1.58 h 1.72 ms 225.44 us spinlock > folio_wake_bit+0x62 > 0xffffffffadbf60a3 > native_queued_spin_lock_slowpath+0x1f3 > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > 0xffffffffacdcf322 folio_wake_bit+0x62 > 0xffffffffacdd2ca7 filemap_get_pages+0x627 > 0xffffffffacdd2e73 filemap_read+0x103 > 0xffffffffad1d67ba blkdev_read_iter+0x6a > 0xffffffffacf06937 vfs_read+0x297 > 0xffffffffacf07653 ksys_read+0x73 > 44757761 1.05 h 1.55 ms 84.41 us spinlock > folio_wake_bit+0x62 > 0xffffffffadbf60a3 > native_queued_spin_lock_slowpath+0x1f3 > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > 0xffffffffacdcf322 folio_wake_bit+0x62 > 0xffffffffacdcf7bc folio_end_read+0x2c > 0xffffffffacf6d4cf mpage_read_end_io+0x6f > 0xffffffffad1d8abb bio_endio+0x12b > 0xffffffffad1f07bd blk_mq_end_request_batch+0x12d > 0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb [snip] > However a point of concern is that FIO bandwidth comes down drastically > after the change. > Nicely put :) > default inode_lock-fix > rw=30% > Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s > Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s > Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s > Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s > This means that the folio waiting stuff has poor scalability, but without digging into it I have no idea what can be done. The easy way out would be to speculatively spin before buggering off, but one would have to check what happens in real workloads -- presumably the lock owner can be off cpu for a long time (I presume there is no way to store the owner). The now-removed lock uses rwsems which behave better when contested and was pulling contention away from folios, artificially *helping* performance by having the folio bottleneck be exercised less. The right thing to do in the long run is still to whack the llseek lock acquire, but in the light of the above it can probably wait for better times. -- Mateusz Guzik <mjguzik gmail.com>