On Wed, Feb 12, 2025 at 08:10:04PM +0100, Christoph Hellwig wrote: > On Fri, Feb 07, 2025 at 11:04:06AM +0100, Daniel Gomez wrote: > > > > performance in the stx_blksize field of the statx data structure. This > > > > change updates the current default 4 KiB block size for all devices > > > > reporting a minimum I/O larger than 4 KiB, opting instead to query for > > > > its advertised minimum I/O value in the statx data struct. > > > > > > UUuh, no. Larger block sizes have their use cases, but this will > > > regress performance for a lot (most?) common setups. A lot of > > > device report fairly high values there, but say increasing the > > > > Are these devices reporting the correct value? > > Who defines what "correct" means to start with? That's a good question. The stx_blksize field description indicates the value should be referring to the fs block size that avoids RMW. * From statx manual: DESCRIPTION struct statx { __u32 stx_blksize; /* Block size for filesystem I/O */ {...} The returned information stx_blksize The "preferred" block size for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.) * From include/uapi/linux/stat.h: struct statx { /* Preferred general I/O size [uncond] */ __u32 stx_blksize; So I think, if devices report high values in stx_blksize, it is either because smaller values than the reported one cause RMW or they are incorrectly reporting a value in the wrong statx field. The commit I refer in the commit message maps the minimum_io_size reported by the block layer with stx_blksize. * Documentation/ABI/stable/sysfs-block What: /sys/block/<disk>/queue/minimum_io_size Date: April 2009 Contact: Martin K. Petersen <martin.petersen@xxxxxxxxxx> Description: [RO] Storage devices may report a granularity or preferred minimum I/O size which is the smallest request the device can perform without incurring a performance penalty. For disk drives this is often the physical block size. For RAID arrays it is often the stripe chunk size. A properly aligned multiple of minimum_io_size is the preferred request size for workloads where a high number of I/O operations is desired. I guess the correct approach is to ensure the mapping only occurs for "disk drives" and we avoid mapping RAID strip chunk size to stx_blksize. In addition, we also have the optimal I/O size: What: /sys/block/<disk>/queue/optimal_io_size Date: April 2009 Contact: Martin K. Petersen <martin.petersen@xxxxxxxxxx> Description: [RO] Storage devices may report an optimal I/O size, which is the device's preferred unit for sustained I/O. This is rarely reported for disk drives. For RAID arrays it is usually the stripe width or the internal track size. A properly aligned multiple of optimal_io_size is the preferred request size for workloads where sustained throughput is desired. If no optimal I/O size is reported this file contains 0. Optimal I/O is not preferred I/O (minimum_io_size). However, I think xfs mount option mixes both values: * From Documentation/admin-guide/xfs.rst Mount Options ============= largeio or nolargeio (default) If ``nolargeio`` is specified, the optimal I/O reported in ``st_blksize`` by **stat(2)** will be as small as possible to allow user applications to avoid inefficient read/modify/write I/O. This is typically the page size of the machine, as this is the granularity of the page cache. But it must be referring to the minimum_io_size as this is the limit "to avoid inefficient read/modify/write I/O" as per sysfs-block minimum_io_size description, right? > > > As I mentioned in my > > discussion with Darrick, matching the minimum_io_size with the "fs > > fundamental blocksize" actually allows to avoid RMW operations (when > > using the default path in mkfs.xfs and the value reported is within > > boundaries). > > At least for buffered I/O it does not remove RMW operations at all, > it just moves them up. But only if the RMW boundary is smaller than the fs bs. If they are matched, it should not move them up. > > And for plenty devices the min_io size might be set to larger than > LBA size, Yes, I agree. > but the device is still reasonably efficient at handling > smaller I/O (e.g. because it has power loss protection and stages That is device specific and covered in the stx_blksize and minimum_io_size descriptions. > writes in powerfail protected memory). Assuming min_io is lsblk MIN-IO reported field, this is the block minimum_io_size. * From util-linux code: case COL_MINIO: ul_path_read_string(dev->sysfs, &str, "queue/minimum_io_size"); if (rawdata) str2u64(str, rawdata); break; I think this might be a bit trickier. The value should report the disk "preferred I/O minimum granularity" and "the stripe chunk size" for RAID arrays.