On Thu, Nov 28, 2024 at 5:22 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@xxxxxxx> wrote: > > > > The contention with inode_lock is gone after your above changes. The new > > top 10 contention data looks like this now: > > > > contended total wait max wait avg wait type caller > > > > 2441494015 172.15 h 1.72 ms 253.83 us spinlock > > folio_wait_bit_common+0xd5 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf5d01 _raw_spin_lock_irq+0x51 > > 0xffffffffacdd1905 folio_wait_bit_common+0xd5 > > 0xffffffffacdd2d0a filemap_get_pages+0x68a > > 0xffffffffacdd2e73 filemap_read+0x103 > > 0xffffffffad1d67ba blkdev_read_iter+0x6a > > 0xffffffffacf06937 vfs_read+0x297 > > 0xffffffffacf07653 ksys_read+0x73 > > 25269947 1.58 h 1.72 ms 225.44 us spinlock > > folio_wake_bit+0x62 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > > 0xffffffffacdcf322 folio_wake_bit+0x62 > > 0xffffffffacdd2ca7 filemap_get_pages+0x627 > > 0xffffffffacdd2e73 filemap_read+0x103 > > 0xffffffffad1d67ba blkdev_read_iter+0x6a > > 0xffffffffacf06937 vfs_read+0x297 > > 0xffffffffacf07653 ksys_read+0x73 > > 44757761 1.05 h 1.55 ms 84.41 us spinlock > > folio_wake_bit+0x62 > > 0xffffffffadbf60a3 > > native_queued_spin_lock_slowpath+0x1f3 > > 0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c > > 0xffffffffacdcf322 folio_wake_bit+0x62 > > 0xffffffffacdcf7bc folio_end_read+0x2c > > 0xffffffffacf6d4cf mpage_read_end_io+0x6f > > 0xffffffffad1d8abb bio_endio+0x12b > > 0xffffffffad1f07bd blk_mq_end_request_batch+0x12d > > 0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb > [snip] > > However a point of concern is that FIO bandwidth comes down drastically > > after the change. > > > > Nicely put :) > > > default inode_lock-fix > > rw=30% > > Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s > > Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s > > Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s > > Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s > > > > This means that the folio waiting stuff has poor scalability, but > without digging into it I have no idea what can be done. The easy way > out would be to speculatively spin before buggering off, but one would > have to check what happens in real workloads -- presumably the lock > owner can be off cpu for a long time (I presume there is no way to > store the owner). > > The now-removed lock uses rwsems which behave better when contested > and was pulling contention away from folios, artificially *helping* > performance by having the folio bottleneck be exercised less. > > The right thing to do in the long run is still to whack the llseek > lock acquire, but in the light of the above it can probably wait for > better times. WIlly mentioned the folio wait queue hash table could be grown, you can find it in mm/filemap.c: 1062 #define PAGE_WAIT_TABLE_BITS 8 1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) 1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; 1065 1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio) 1067 { 1068 │ return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)]; 1069 } Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out On debian this ships with the bpfcc-tools package. -- Mateusz Guzik <mjguzik gmail.com>