Re: [RFC PATCH 0/1] Large folios in block buffered IO path

Mateusz Guzik <mjguzik@xxxxxxxxx> · Thu, 28 Nov 2024 05:31:38 +0100

On Thu, Nov 28, 2024 at 5:22 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@xxxxxxx> wrote:
> >
> > The contention with inode_lock is gone after your above changes. The new
> > top 10 contention data looks like this now:
> >
> >   contended   total wait     max wait     avg wait         type   caller
> >
> > 2441494015    172.15 h       1.72 ms    253.83 us     spinlock
> > folio_wait_bit_common+0xd5
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
> >                          0xffffffffacdd1905  folio_wait_bit_common+0xd5
> >                          0xffffffffacdd2d0a  filemap_get_pages+0x68a
> >                          0xffffffffacdd2e73  filemap_read+0x103
> >                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
> >                          0xffffffffacf06937  vfs_read+0x297
> >                          0xffffffffacf07653  ksys_read+0x73
> >    25269947      1.58 h       1.72 ms    225.44 us     spinlock
> > folio_wake_bit+0x62
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
> >                          0xffffffffacdcf322  folio_wake_bit+0x62
> >                          0xffffffffacdd2ca7  filemap_get_pages+0x627
> >                          0xffffffffacdd2e73  filemap_read+0x103
> >                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
> >                          0xffffffffacf06937  vfs_read+0x297
> >                          0xffffffffacf07653  ksys_read+0x73
> >    44757761      1.05 h       1.55 ms     84.41 us     spinlock
> > folio_wake_bit+0x62
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
> >                          0xffffffffacdcf322  folio_wake_bit+0x62
> >                          0xffffffffacdcf7bc  folio_end_read+0x2c
> >                          0xffffffffacf6d4cf  mpage_read_end_io+0x6f
> >                          0xffffffffad1d8abb  bio_endio+0x12b
> >                          0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
> >                          0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
> [snip]
> > However a point of concern is that FIO bandwidth comes down drastically
> > after the change.
> >
>
> Nicely put :)
>
> >                 default                         inode_lock-fix
> > rw=30%
> > Instance 1      r=55.7GiB/s,w=23.9GiB/s         r=9616MiB/s,w=4121MiB/s
> > Instance 2      r=38.5GiB/s,w=16.5GiB/s         r=8482MiB/s,w=3635MiB/s
> > Instance 3      r=37.5GiB/s,w=16.1GiB/s         r=8609MiB/s,w=3690MiB/s
> > Instance 4      r=37.4GiB/s,w=16.0GiB/s         r=8486MiB/s,w=3637MiB/s
> >
>
> This means that the folio waiting stuff has poor scalability, but
> without digging into it I have no idea what can be done. The easy way
> out would be to speculatively spin before buggering off, but one would
> have to check what happens in real workloads -- presumably the lock
> owner can be off cpu for a long time (I presume there is no way to
> store the owner).
>
> The now-removed lock uses rwsems which behave better when contested
> and was pulling contention away from folios, artificially *helping*
> performance by having the folio bottleneck be exercised less.
>
> The right thing to do in the long run is still to whack the llseek
> lock acquire, but in the light of the above it can probably wait for
> better times.

WIlly mentioned the folio wait queue hash table could be grown, you
can find it in mm/filemap.c:
  1062 #define PAGE_WAIT_TABLE_BITS 8
  1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
  1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
__cacheline_aligned;
  1065
  1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
  1067 {
  1068 │       return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
  1069 }

Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out

On debian this ships with the bpfcc-tools package.

-- 
Mateusz Guzik <mjguzik gmail.com>