On 6/21/23 10:38, Pankaj Raghav wrote:
There has been a lot of discussion recently to support devices and fs for bs > ps. One of the main plumbing to support buffered IO is to have a minimum order while allocating folios in the page cache. Hannes sent recently a series[1] where he deduces the minimum folio order based on the i_blkbits in struct inode. This takes a different approach based on the discussion in that thread where the minimum and maximum folio order can be set individually per inode. This series is based on top of Christoph's patches to have iomap aops for the block cache[2]. I rebased his remaining patches to next-20230621. The whole tree can be found here[3]. Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered IO on a nvme drive with bs>ps in QEMU without any issues: [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size 16384 [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5 io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8 fio-3.34 Starting 1 process Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s] io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023 read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec) <snip> Run status group 0 (all jobs): READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec Disk stats (read/write): nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27% One of the main dependency to work on a block device with bs>ps is Christoph's work on converting block device aops to use iomap. [1] https://lwn.net/Articles/934651/ [2] https://lwn.net/ml/linux-kernel/20230424054926.26927-1-hch@xxxxxx/ [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1 Luis Chamberlain (1): block: set mapping order for the block cache in set_init_blocksize Matthew Wilcox (Oracle) (1): fs: Allow fine-grained control of folio sizes Pankaj Raghav (2): filemap: use minimum order while allocating folios nvme: enable logical block size > PAGE_SIZE block/bdev.c | 9 ++++++++ drivers/nvme/host/core.c | 2 +- include/linux/pagemap.h | 46 ++++++++++++++++++++++++++++++++++++---- mm/filemap.c | 9 +++++--- mm/readahead.c | 34 ++++++++++++++++++++--------- 5 files changed, 82 insertions(+), 18 deletions(-)
Hmm. Most unfortunate; I've just finished my own patchset (duplicating much of this work) to get 'brd' running with large folios. And it even works this time, 'fsx' from the xfstest suite runs happily on that.
Guess we'll need to reconcile our patches. Cheers, Hannes