On 27-Nov-24 5:58 PM, Mateusz Guzik wrote:
On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@xxxxxxx> wrote:
On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
That is to say bare minimum this needs to be benchmarked before/after
with the lock removed from the picture, like so:
diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..7f9e9e2f9081 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
{
struct inode *bd_inode = bdev_file_inode(file);
- loff_t retval;
- inode_lock(bd_inode);
- retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
- inode_unlock(bd_inode);
- return retval;
+ return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
}
static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
To be aborted if it blows up (but I don't see why it would).
Thanks for this fix, will try and get back with results.
Please make sure to have results just with this change, no messing
with folio sizes so that I have something for the patch submission.
The contention with inode_lock is gone after your above changes. The new
top 10 contention data looks like this now:
contended total wait max wait avg wait type caller
2441494015 172.15 h 1.72 ms 253.83 us spinlock
folio_wait_bit_common+0xd5
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5d01 _raw_spin_lock_irq+0x51
0xffffffffacdd1905 folio_wait_bit_common+0xd5
0xffffffffacdd2d0a filemap_get_pages+0x68a
0xffffffffacdd2e73 filemap_read+0x103
0xffffffffad1d67ba blkdev_read_iter+0x6a
0xffffffffacf06937 vfs_read+0x297
0xffffffffacf07653 ksys_read+0x73
25269947 1.58 h 1.72 ms 225.44 us spinlock
folio_wake_bit+0x62
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacdcf322 folio_wake_bit+0x62
0xffffffffacdd2ca7 filemap_get_pages+0x627
0xffffffffacdd2e73 filemap_read+0x103
0xffffffffad1d67ba blkdev_read_iter+0x6a
0xffffffffacf06937 vfs_read+0x297
0xffffffffacf07653 ksys_read+0x73
44757761 1.05 h 1.55 ms 84.41 us spinlock
folio_wake_bit+0x62
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacdcf322 folio_wake_bit+0x62
0xffffffffacdcf7bc folio_end_read+0x2c
0xffffffffacf6d4cf mpage_read_end_io+0x6f
0xffffffffad1d8abb bio_endio+0x12b
0xffffffffad1f07bd blk_mq_end_request_batch+0x12d
0xffffffffc05e4e9b nvme_pci_complete_batch+0xbb
66419830 53.00 m 685.29 us 47.88 us spinlock
__filemap_add_folio+0x14c
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5d01 _raw_spin_lock_irq+0x51
0xffffffffacdd037c __filemap_add_folio+0x14c
0xffffffffacdd072c filemap_add_folio+0x9c
0xffffffffacde2a4c page_cache_ra_unbounded+0x12c
0xffffffffacde31e0 page_cache_ra_order+0x340
0xffffffffacde33b8 page_cache_sync_ra+0x148
0xffffffffacdd27c5 filemap_get_pages+0x145
532 45.51 m 5.49 s 5.13 s mutex
bdev_release+0x69
0xffffffffadbef1de __mutex_lock.constprop.0+0x17e
0xffffffffadbef863 __mutex_lock_slowpath+0x13
0xffffffffadbef8bb mutex_lock+0x3b
0xffffffffad1d5249 bdev_release+0x69
0xffffffffad1d5921 blkdev_release+0x11
0xffffffffacf089f3 __fput+0xe3
0xffffffffacf08c9b __fput_sync+0x1b
0xffffffffacefe8ed __x64_sys_close+0x3d
264989621 45.26 m 486.17 us 10.25 us spinlock
raw_spin_rq_lock_nested+0x1d
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5c7f _raw_spin_lock+0x3f
0xffffffffacb4f22d raw_spin_rq_lock_nested+0x1d
0xffffffffacb5b601 _raw_spin_rq_lock_irqsave+0x21
0xffffffffacb6f81b sched_balance_rq+0x60b
0xffffffffacb70784 sched_balance_newidle+0x1f4
0xffffffffacb70ae4 pick_next_task_fair+0x54
0xffffffffacb4e583 __pick_next_task+0x43
30026842 26.05 m 652.85 us 52.05 us spinlock
__folio_mark_dirty+0x2b
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacde189b __folio_mark_dirty+0x2b
0xffffffffacf67503 mark_buffer_dirty+0x53
0xffffffffacf676e2 __block_commit_write+0x82
0xffffffffacf685fc block_write_end+0x3c
0xffffffffacfb577e iomap_write_end+0x13e
0xffffffffacfb674c iomap_file_buffered_write+0x29c
5085820 7.07 m 1.24 ms 83.42 us spinlock
folio_wake_bit+0x62
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf537c _raw_spin_lock_irqsave+0x5c
0xffffffffacdcf322 folio_wake_bit+0x62
0xffffffffacdcf7bc folio_end_read+0x2c
0xffffffffacf6d4cf mpage_read_end_io+0x6f
0xffffffffad1d8abb bio_endio+0x12b
0xffffffffad1ee8ca blk_update_request+0x1aa
0xffffffffad1ef7a4 blk_mq_end_request+0x24
8027370 1.90 m 76.21 us 14.23 us spinlock
tick_do_update_jiffies64+0x29
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5c7f _raw_spin_lock+0x3f
0xffffffffacc21c69 tick_do_update_jiffies64+0x29
0xffffffffacc21e8c tick_nohz_handler+0x13c
0xffffffffacc0a0cf __hrtimer_run_queues+0x10f
0xffffffffacc0afe8 hrtimer_interrupt+0xf8
0xffffffffacaa8f36
__sysvec_apic_timer_interrupt+0x56
0xffffffffadbe3953
sysvec_apic_timer_interrupt+0x93
4344269 1.08 m 600.67 us 14.90 us spinlock
__filemap_add_folio+0x14c
0xffffffffadbf60a3
native_queued_spin_lock_slowpath+0x1f3
0xffffffffadbf5d01 _raw_spin_lock_irq+0x51
0xffffffffacdd037c __filemap_add_folio+0x14c
0xffffffffacdd072c filemap_add_folio+0x9c
0xffffffffacde2a4c page_cache_ra_unbounded+0x12c
0xffffffffacde31e0 page_cache_ra_order+0x340
0xffffffffacde3615 page_cache_async_ra+0x165
0xffffffffacdd2b9c filemap_get_pages+0x51c
However a point of concern is that FIO bandwidth comes down drastically
after the change.
default inode_lock-fix
rw=30%
Instance 1 r=55.7GiB/s,w=23.9GiB/s r=9616MiB/s,w=4121MiB/s
Instance 2 r=38.5GiB/s,w=16.5GiB/s r=8482MiB/s,w=3635MiB/s
Instance 3 r=37.5GiB/s,w=16.1GiB/s r=8609MiB/s,w=3690MiB/s
Instance 4 r=37.4GiB/s,w=16.0GiB/s r=8486MiB/s,w=3637MiB/s
Regards,
Bharata.