Re: [RFC PATCH 0/1] Large folios in block buffered IO path

Bharata B Rao <bharata@xxxxxxx> · Wed, 27 Nov 2024 17:48:31 +0530

On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:

On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@xxxxxxx> wrote:

Recently we discussed the scalability issues while running large
instances of FIO with buffered IO option on NVME block devices here:

https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@xxxxxxx/

One of the suggestions Chris Mason gave (during private discussions) was
to enable large folios in block buffered IO path as that could
improve the scalability problems and improve the lock contention
scenarios.


I have no basis to comment on the idea.

However, it is pretty apparent whatever the situation it is being
heavily disfigured by lock contention in blkdev_llseek:

perf-lock contention output
---------------------------
The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
mix it looks like this:

perf-lock contention default
  contended   total wait     max wait     avg wait         type   caller

1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
                         0xffffffff8f39e88f  up_write+0x4f
                         0xffffffff8f9d598e  blkdev_llseek+0x4e
                         0xffffffff8f703322  ksys_lseek+0x72
                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
                         0xffffffff8f20b983  x64_sys_call+0x1fb3
    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
                         0xffffffff903f18fb  down_write+0x5b
                         0xffffffff8f9d5971  blkdev_llseek+0x31
                         0xffffffff8f703322  ksys_lseek+0x72
                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
                         0xffffffff8f20b983  x64_sys_call+0x1fb3
                         0xffffffff903dce5e  do_syscall_64+0x7e
                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76

Admittedly I'm not familiar with this code, but at a quick glance the
lock can be just straight up removed here?

   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
   535 {
   536 │       struct inode *bd_inode = bdev_file_inode(file);
   537 │       loff_t retval;
   538 │
   539 │       inode_lock(bd_inode);
   540 │       retval = fixed_size_llseek(file, offset, whence,
i_size_read(bd_inode));
   541 │       inode_unlock(bd_inode);
   542 │       return retval;
   543 }

At best it stabilizes the size for the duration of the call. Sounds
like it helps nothing since if the size can change, the file offset
will still be altered as if there was no locking?

Suppose this cannot be avoided to grab the size for whatever reason.

While the above fio invocation did not work for me, I ran some crapper
which I had in my shell history and according to strace:
[pid 271829] lseek(7, 0, SEEK_SET)      = 0
[pid 271829] lseek(7, 0, SEEK_SET)      = 0
[pid 271830] lseek(7, 0, SEEK_SET)      = 0

... the lseeks just rewind to the beginning, *definitely* not needing
to know the size. One would have to check but this is most likely the
case in your test as well.

And for that there is 0 need to grab the size, and consequently the inode lock.

Here is the complete FIO cmdline I am using:

fio -filename=/dev/nvme1n1p1 -direct=0 -thread -size=800G -rw=rw 
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=64k 
-numjobs=1 -runtime=3600 --time_based -group_reporting -name=mytest

And that results in lseek patterns like these:

lseek(6, 0, SEEK_SET)             = 0
lseek(6, 131072, SEEK_SET)        = 131072
lseek(6, 65536, SEEK_SET)         = 65536
lseek(6, 196608, SEEK_SET)        = 196608
lseek(6, 131072, SEEK_SET)        = 131072
lseek(6, 393216, SEEK_SET)        = 393216
lseek(6, 196608, SEEK_SET)        = 196608
lseek(6, 458752, SEEK_SET)        = 458752
lseek(6, 262144, SEEK_SET)        = 262144
lseek(6, 1114112, SEEK_SET)       = 1114112

The lseeks are interspersed with read and write calls.


That is to say bare minimum this needs to be benchmarked before/after
with the lock removed from the picture, like so:

diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..7f9e9e2f9081 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
  static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
  {
         struct inode *bd_inode = bdev_file_inode(file);
-       loff_t retval;

-       inode_lock(bd_inode);
-       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
-       inode_unlock(bd_inode);
-       return retval;
+       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
  }

  static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,

To be aborted if it blows up (but I don't see why it would).

Thanks for this fix, will try and get back with results.

Regards,
Bharata.