Re: [PATCH] mkfs: use stx_blksize for dev block size by default

Daniel Gomez <da.gomez@xxxxxxxxxx> · Thu, 13 Feb 2025 14:26:37 +0100

On Wed, Feb 12, 2025 at 08:10:04PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 07, 2025 at 11:04:06AM +0100, Daniel Gomez wrote:
> > > > performance in the stx_blksize field of the statx data structure. This
> > > > change updates the current default 4 KiB block size for all devices
> > > > reporting a minimum I/O larger than 4 KiB, opting instead to query for
> > > > its advertised minimum I/O value in the statx data struct.
> > > 
> > > UUuh, no.  Larger block sizes have their use cases, but this will
> > > regress performance for a lot (most?) common setups.  A lot of
> > > device report fairly high values there, but say increasing the
> > 
> > Are these devices reporting the correct value?
> 
> Who defines what "correct" means to start with?

That's a good question. The stx_blksize field description indicates the value
should be referring to the fs block size that avoids RMW.

* From statx manual:

	DESCRIPTION
	           struct statx {
	               __u32 stx_blksize;     /* Block size for filesystem I/O */
	{...}

	The returned information
	       stx_blksize
	              The "preferred" block size for efficient filesystem I/O.  (Writing
	              to a file in smaller chunks may cause an inefficient read-modify-rewrite.)

* From include/uapi/linux/stat.h:

	struct statx {

		/* Preferred general I/O size [uncond] */
		__u32	stx_blksize;

So I think, if devices report high values in stx_blksize, it is either because
smaller values than the reported one cause RMW or they are incorrectly reporting
a value in the wrong statx field.

The commit I refer in the commit message maps the minimum_io_size reported by
the block layer with stx_blksize.

* Documentation/ABI/stable/sysfs-block

	What:		/sys/block/<disk>/queue/minimum_io_size
	Date:		April 2009
	Contact:	Martin K. Petersen <martin.petersen@xxxxxxxxxx>
	Description:
			[RO] Storage devices may report a granularity or preferred
			minimum I/O size which is the smallest request the device can
			perform without incurring a performance penalty.  For disk
			drives this is often the physical block size.  For RAID arrays
			it is often the stripe chunk size.  A properly aligned multiple
			of minimum_io_size is the preferred request size for workloads
			where a high number of I/O operations is desired.

I guess the correct approach is to ensure the mapping only occurs for "disk
drives" and we avoid mapping RAID strip chunk size to stx_blksize.

In addition, we also have the optimal I/O size:

What:		/sys/block/<disk>/queue/optimal_io_size
Date:		April 2009
Contact:	Martin K. Petersen <martin.petersen@xxxxxxxxxx>
Description:
		[RO] Storage devices may report an optimal I/O size, which is
		the device's preferred unit for sustained I/O.  This is rarely
		reported for disk drives.  For RAID arrays it is usually the
		stripe width or the internal track size.  A properly aligned
		multiple of optimal_io_size is the preferred request size for
		workloads where sustained throughput is desired.  If no optimal
		I/O size is reported this file contains 0.

Optimal I/O is not preferred I/O (minimum_io_size). However, I think xfs mount
option mixes both values:

* From Documentation/admin-guide/xfs.rst

	Mount Options
	=============

	  largeio or nolargeio (default)
		If ``nolargeio`` is specified, the optimal I/O reported in
		``st_blksize`` by **stat(2)** will be as small as possible to allow
		user applications to avoid inefficient read/modify/write
		I/O.  This is typically the page size of the machine, as
		this is the granularity of the page cache.

But it must be referring to the minimum_io_size as this is the limit "to
avoid inefficient read/modify/write I/O" as per sysfs-block minimum_io_size
description, right?

> 
> > As I mentioned in my
> > discussion with Darrick, matching the minimum_io_size with the "fs
> > fundamental blocksize" actually allows to avoid RMW operations (when
> > using the default path in mkfs.xfs and the value reported is within
> > boundaries).
> 
> At least for buffered I/O it does not remove RMW operations at all,
> it just moves them up.

But only if the RMW boundary is smaller than the fs bs. If they are matched,
it should not move them up.

> 
> And for plenty devices the min_io size might be set to larger than
> LBA size,

Yes, I agree.

> but the device is still reasonably efficient at handling
> smaller I/O (e.g. because it has power loss protection and stages

That is device specific and covered in the stx_blksize and minimum_io_size
descriptions.

> writes in powerfail protected memory).

Assuming min_io is lsblk MIN-IO reported field, this is the block
minimum_io_size.

* From util-linux code:

	case COL_MINIO:
		ul_path_read_string(dev->sysfs, &str, "queue/minimum_io_size");
		if (rawdata)
			str2u64(str, rawdata);
		break;

I think this might be a bit trickier. The value should report the disk
"preferred I/O minimum granularity" and "the stripe chunk size" for RAID arrays.