On Wed 27-11-24 07:19:59, Mateusz Guzik wrote: > On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@xxxxxxx> wrote: > > > > > > Recently we discussed the scalability issues while running large > > > instances of FIO with buffered IO option on NVME block devices here: > > > > > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@xxxxxxx/ > > > > > > One of the suggestions Chris Mason gave (during private discussions) was > > > to enable large folios in block buffered IO path as that could > > > improve the scalability problems and improve the lock contention > > > scenarios. > > > > > > > I have no basis to comment on the idea. > > > > However, it is pretty apparent whatever the situation it is being > > heavily disfigured by lock contention in blkdev_llseek: > > > > > perf-lock contention output > > > --------------------------- > > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite > > > mix it looks like this: > > > > > > perf-lock contention default > > > contended total wait max wait avg wait type caller > > > > > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42 > > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3 > > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c > > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42 > > > 0xffffffff8f39e88f up_write+0x4f > > > 0xffffffff8f9d598e blkdev_llseek+0x4e > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31 > > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c > > > 0xffffffff903f18fb down_write+0x5b > > > 0xffffffff8f9d5971 blkdev_llseek+0x31 > > > 0xffffffff8f703322 ksys_lseek+0x72 > > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18 > > > 0xffffffff8f20b983 x64_sys_call+0x1fb3 > > > 0xffffffff903dce5e do_syscall_64+0x7e > > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76 > > > > Admittedly I'm not familiar with this code, but at a quick glance the > > lock can be just straight up removed here? > > > > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > > 535 { > > 536 │ struct inode *bd_inode = bdev_file_inode(file); > > 537 │ loff_t retval; > > 538 │ > > 539 │ inode_lock(bd_inode); > > 540 │ retval = fixed_size_llseek(file, offset, whence, > > i_size_read(bd_inode)); > > 541 │ inode_unlock(bd_inode); > > 542 │ return retval; > > 543 } > > > > At best it stabilizes the size for the duration of the call. Sounds > > like it helps nothing since if the size can change, the file offset > > will still be altered as if there was no locking? > > > > Suppose this cannot be avoided to grab the size for whatever reason. > > > > While the above fio invocation did not work for me, I ran some crapper > > which I had in my shell history and according to strace: > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > [pid 271829] lseek(7, 0, SEEK_SET) = 0 > > [pid 271830] lseek(7, 0, SEEK_SET) = 0 > > > > ... the lseeks just rewind to the beginning, *definitely* not needing > > to know the size. One would have to check but this is most likely the > > case in your test as well. > > > > And for that there is 0 need to grab the size, and consequently the inode lock. > > That is to say bare minimum this needs to be benchmarked before/after > with the lock removed from the picture, like so: Yeah, I've noticed this in the locking profiles as well and I agree bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4) get away without using inode lock in their llseek handler... Honza > diff --git a/block/fops.c b/block/fops.c > index 2d01c9007681..7f9e9e2f9081 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = { > static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence) > { > struct inode *bd_inode = bdev_file_inode(file); > - loff_t retval; > > - inode_lock(bd_inode); > - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > - inode_unlock(bd_inode); > - return retval; > + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode)); > } > > static int blkdev_fsync(struct file *filp, loff_t start, loff_t end, > > To be aborted if it blows up (but I don't see why it would). > > -- > Mateusz Guzik <mjguzik gmail.com> -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR