Re: [RFC PATCH 0/1] Large folios in block buffered IO path

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 27-Nov-24 5:58 PM, Mateusz Guzik wrote:
On Wed, Nov 27, 2024 at 1:18 PM Bharata B Rao <bharata@xxxxxxx> wrote:

On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
That is to say bare minimum this needs to be benchmarked before/after
with the lock removed from the picture, like so:

diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..7f9e9e2f9081 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
   static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
   {
          struct inode *bd_inode = bdev_file_inode(file);
-       loff_t retval;

-       inode_lock(bd_inode);
-       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
-       inode_unlock(bd_inode);
-       return retval;
+       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
   }

   static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,

To be aborted if it blows up (but I don't see why it would).

Thanks for this fix, will try and get back with results.


Please make sure to have results just with this change, no messing
with folio sizes so that I have something for the patch submission.

The contention with inode_lock is gone after your above changes. The new top 10 contention data looks like this now:

 contended   total wait     max wait     avg wait         type   caller

2441494015 172.15 h 1.72 ms 253.83 us spinlock folio_wait_bit_common+0xd5 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                        0xffffffffacdd1905  folio_wait_bit_common+0xd5
                        0xffffffffacdd2d0a  filemap_get_pages+0x68a
                        0xffffffffacdd2e73  filemap_read+0x103
                        0xffffffffad1d67ba  blkdev_read_iter+0x6a
                        0xffffffffacf06937  vfs_read+0x297
                        0xffffffffacf07653  ksys_read+0x73
25269947 1.58 h 1.72 ms 225.44 us spinlock folio_wake_bit+0x62 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffffacdcf322  folio_wake_bit+0x62
                        0xffffffffacdd2ca7  filemap_get_pages+0x627
                        0xffffffffacdd2e73  filemap_read+0x103
                        0xffffffffad1d67ba  blkdev_read_iter+0x6a
                        0xffffffffacf06937  vfs_read+0x297
                        0xffffffffacf07653  ksys_read+0x73
44757761 1.05 h 1.55 ms 84.41 us spinlock folio_wake_bit+0x62 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffffacdcf322  folio_wake_bit+0x62
                        0xffffffffacdcf7bc  folio_end_read+0x2c
                        0xffffffffacf6d4cf  mpage_read_end_io+0x6f
                        0xffffffffad1d8abb  bio_endio+0x12b
                        0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
                        0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
66419830 53.00 m 685.29 us 47.88 us spinlock __filemap_add_folio+0x14c 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                        0xffffffffacdd037c  __filemap_add_folio+0x14c
                        0xffffffffacdd072c  filemap_add_folio+0x9c
                        0xffffffffacde2a4c  page_cache_ra_unbounded+0x12c
                        0xffffffffacde31e0  page_cache_ra_order+0x340
                        0xffffffffacde33b8  page_cache_sync_ra+0x148
                        0xffffffffacdd27c5  filemap_get_pages+0x145
532 45.51 m 5.49 s 5.13 s mutex bdev_release+0x69
                        0xffffffffadbef1de  __mutex_lock.constprop.0+0x17e
                        0xffffffffadbef863  __mutex_lock_slowpath+0x13
                        0xffffffffadbef8bb  mutex_lock+0x3b
                        0xffffffffad1d5249  bdev_release+0x69
                        0xffffffffad1d5921  blkdev_release+0x11
                        0xffffffffacf089f3  __fput+0xe3
                        0xffffffffacf08c9b  __fput_sync+0x1b
                        0xffffffffacefe8ed  __x64_sys_close+0x3d
264989621 45.26 m 486.17 us 10.25 us spinlock raw_spin_rq_lock_nested+0x1d 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf5c7f  _raw_spin_lock+0x3f
                        0xffffffffacb4f22d  raw_spin_rq_lock_nested+0x1d
                        0xffffffffacb5b601  _raw_spin_rq_lock_irqsave+0x21
                        0xffffffffacb6f81b  sched_balance_rq+0x60b
                        0xffffffffacb70784  sched_balance_newidle+0x1f4
                        0xffffffffacb70ae4  pick_next_task_fair+0x54
                        0xffffffffacb4e583  __pick_next_task+0x43
30026842 26.05 m 652.85 us 52.05 us spinlock __folio_mark_dirty+0x2b 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffffacde189b  __folio_mark_dirty+0x2b
                        0xffffffffacf67503  mark_buffer_dirty+0x53
                        0xffffffffacf676e2  __block_commit_write+0x82
                        0xffffffffacf685fc  block_write_end+0x3c
                        0xffffffffacfb577e  iomap_write_end+0x13e
                        0xffffffffacfb674c  iomap_file_buffered_write+0x29c
5085820 7.07 m 1.24 ms 83.42 us spinlock folio_wake_bit+0x62 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffffacdcf322  folio_wake_bit+0x62
                        0xffffffffacdcf7bc  folio_end_read+0x2c
                        0xffffffffacf6d4cf  mpage_read_end_io+0x6f
                        0xffffffffad1d8abb  bio_endio+0x12b
                        0xffffffffad1ee8ca  blk_update_request+0x1aa
                        0xffffffffad1ef7a4  blk_mq_end_request+0x24
8027370 1.90 m 76.21 us 14.23 us spinlock tick_do_update_jiffies64+0x29 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf5c7f  _raw_spin_lock+0x3f
                        0xffffffffacc21c69  tick_do_update_jiffies64+0x29
                        0xffffffffacc21e8c  tick_nohz_handler+0x13c
                        0xffffffffacc0a0cf  __hrtimer_run_queues+0x10f
                        0xffffffffacc0afe8  hrtimer_interrupt+0xf8
0xffffffffacaa8f36 __sysvec_apic_timer_interrupt+0x56 0xffffffffadbe3953 sysvec_apic_timer_interrupt+0x93 4344269 1.08 m 600.67 us 14.90 us spinlock __filemap_add_folio+0x14c 0xffffffffadbf60a3 native_queued_spin_lock_slowpath+0x1f3
                        0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
                        0xffffffffacdd037c  __filemap_add_folio+0x14c
                        0xffffffffacdd072c  filemap_add_folio+0x9c
                        0xffffffffacde2a4c  page_cache_ra_unbounded+0x12c
                        0xffffffffacde31e0  page_cache_ra_order+0x340
                        0xffffffffacde3615  page_cache_async_ra+0x165
                        0xffffffffacdd2b9c  filemap_get_pages+0x51c

However a point of concern is that FIO bandwidth comes down drastically after the change.

		default				inode_lock-fix
rw=30%
Instance 1	r=55.7GiB/s,w=23.9GiB/s		r=9616MiB/s,w=4121MiB/s
Instance 2	r=38.5GiB/s,w=16.5GiB/s		r=8482MiB/s,w=3635MiB/s
Instance 3	r=37.5GiB/s,w=16.1GiB/s		r=8609MiB/s,w=3690MiB/s
Instance 4	r=37.4GiB/s,w=16.0GiB/s		r=8486MiB/s,w=3637MiB/s

Regards,
Bharata.





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux