Re: [PATCH v4 05/11] block: Add core atomic write support

John Garry <john.g.garry@xxxxxxxxxx> · Mon, 26 Feb 2024 09:23:35 +0000

On 25/02/2024 12:09, Ritesh Harjani (IBM) wrote:
John Garry <john.g.garry@xxxxxxxxxx> writes:

Add atomic write support as follows:
- report request_queue atomic write support limits to sysfs and udpate Doc
- add helper functions to get request_queue atomic write limits
- support to safely merge atomic writes
- add a per-request atomic write flag
- deal with splitting atomic writes
- misc helper functions

New sysfs files are added to report the following atomic write limits:
- atomic_write_boundary_bytes
- atomic_write_max_bytes
- atomic_write_unit_max_bytes
- atomic_write_unit_min_bytes

atomic_write_unit_{min,max}_bytes report the min and max atomic write
support size, inclusive, and are primarily dictated by HW capability. Both
values must be a power-of-2. atomic_write_boundary_bytes, if non-zero,
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. atomic_write_max_bytes is the
maximum merged size for an atomic write. Often it will be the same value as
atomic_write_unit_max_bytes.

Instead of explaining sysfs outputs which are deriviatives of HW
and request_queue limits (and also defined in Documentation), maybe we
could explain how those sysfs values are derived instead -

struct queue_limits {
<...>
	unsigned int		atomic_write_hw_max_sectors;
	unsigned int		atomic_write_max_sectors;
	unsigned int		atomic_write_hw_boundary_sectors;
	unsigned int		atomic_write_hw_unit_min_sectors;
	unsigned int		atomic_write_unit_min_sectors;
	unsigned int		atomic_write_hw_unit_max_sectors;
	unsigned int		atomic_write_unit_max_sectors;
<...>

1. atomic_write_unit_hw_max_sectors comes directly from hw and it need
not be a power of 2.

2. atomic_write_hw_unit_min_sectors and atomic_write_hw_unit_max_sectors
is again defined/derived from hw limits, but it is rounded down so that
it is always a power of 2.

3. atomic_write_hw_boundary_sectors again comes from HW boundary limit.
It could either be 0 (which means the device specify no boundary limit) or a
multiple of unit_max. It need not be power of 2, however the current
code assumes it to be a power of 2 (check callers of blk_queue_atomic_write_boundary_bytes())

4. atomic_write_max_sectors, atomic_write_unit_min_sectors
and atomic_write_unit_max_sectors are all derived out of above hw limits
inside function blk_atomic_writes_update_limits() based on request_queue
limits.
     a. atomic_write_max_sectors is derived from atomic_write_hw_unit_max_sectors and
        request_queue's max_hw_sectors limit. It also guarantees max
        sectors that can be fit in a single bio.
     b. atomic_write_unit_[min|max]_sectors are derived from atomic_write_hw_unit_[min|max]_sectors,
        request_queue's max_hw_sectors & blk_queue_max_guaranteed_bio_sectors(). Both of these limits
        are kept as a power of 2.

Now coming to sysfs outputs -
1. atomic_write_unit_max_bytes: Same as atomic_write_unix_max_sectors in bytes
2. atomic_write_unit_min_bytes: Same as atomic_write_unit_min_sectors in bytes
3. atomic_write_boundary_bytes: same as atomic_write_hw_boundary_sectors
in bytes
4. atomic_write_max_bytes: Same as atomic_write_max_sectors in bytes


ok, I can look to incorporate the advised formatting changes


atomic_write_unit_max_bytes is capped at the maximum data size which we are
guaranteed to be able to fit in a BIO, as an atomic write must always be
submitted as a single BIO. This BIO max size is dictated by the number of

Here it says that the atomic write must always be submitted as a single
bio. From where to where?

submitted to the block layer/core

I think you meant from FS to block layer.

sure, or also block device file operations (in fops.c) to block core

Because otherwise we still allow request/bio merging inside block layer
based on the request queue limits we defined above. i.e. bio can be
chained to form
       rq->biotail->bi_next = next_rq->bio
as long as the merged requests is within the queue_limits.

i.e. atomic write requests can be merged as long as -
     - both rqs have REQ_ATOMIC set
     - blk_rq_sectors(final_rq) <= q->limits.atomic_write_max_sectors
     - final rq formed should not straddle limits->atomic_write_hw_boundary_sectors

However, splitting of an atomic write requests is not allowed. And if it
happens, we fail the I/O req & return -EINVAL.

...


IMHO, the commit message can definitely use a re-write. I agree that you
have put in a lot of information, but I think it can be more organized.#

ok, fine. I'll look at this. Thanks.



Contains significant contributions from:
Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>

Myabe it can use a better tag then.
"Documentation/process/submitting-patches.rst"

ok



Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx>
---
  Documentation/ABI/stable/sysfs-block |  52 ++++++++++++++
  block/blk-merge.c                    |  91 ++++++++++++++++++++++-
  block/blk-settings.c                 | 103 +++++++++++++++++++++++++++
  block/blk-sysfs.c                    |  33 +++++++++
  block/blk.h                          |   3 +
  include/linux/blk_types.h            |   2 +
  include/linux/blkdev.h               |  60 ++++++++++++++++
  7 files changed, 343 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..4c775f4bdefe 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -21,6 +21,58 @@ Description:
  		device is offset from the internal allocation unit's
  		natural alignment.

...

  

/* A comment explaining this function and arguments could be helpful */

already addressed according to earlier review


+static bool rq_straddles_atomic_write_boundary(struct request *rq,
+					unsigned int front,
+					unsigned int back)

A better naming perhaps be start_adjust, end_adjust?

ok


+{
+	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
+	unsigned int mask, imask;
+	loff_t start, end;

start_rq_pos, end_rq_pos maybe?

ok


+
+	if (!boundary)
+		return false;
+
+	start = rq->__sector << SECTOR_SHIFT;

blk_rq_pos(rq) perhaps?

ok


+	end = start + rq->__data_len;

blk_rq_bytes(rq) perhaps? It should be..

ok

+
+	start -= front;
+	end += back;
+
+	/* We're longer than the boundary, so must be crossing it */
+	if (end - start > boundary)
+		return true;
+
+	mask = boundary - 1;
+
+	/* start/end are boundary-aligned, so cannot be crossing */
+	if (!(start & mask) || !(end & mask))
+		return false;
+
+	imask = ~mask;
+
+	/* Top bits are different, so crossed a boundary */
+	if ((start & imask) != (end & imask))
+		return true;

The last condition looks wrong. Shouldn't it be end - 1?

+
+	return false;
+}

Can we do something like this?

static bool rq_straddles_atomic_write_boundary(struct request *rq,
					       unsigned int start_adjust,
					       unsigned int end_adjust)
{
	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
	unsigned long boundary_mask;
	unsigned long start_rq_pos, end_rq_pos;

	if (!boundary)
		return false;

	start_rq_pos = blk_rq_pos(rq) << SECTOR_SHIFT;
	end_rq_pos = start_rq_pos + blk_rq_bytes(rq);

	start_rq_pos -= start_adjust;
	end_rq_pos += end_adjust;

	boundary_mask = boundary - 1;

	if ((start_rq_pos | boundary_mask) != (end_rq_pos | boundary_mask))
		return true;

	return false;
}

I was thinking this check should cover all cases? Thoughts?

that looks ok (apart from issue already detected later). It is quite 
similar to how I coded it in the NVMe driver, apart from the initial > 
boundary check.

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index f288c94374b3..cd7cceb8565d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -422,6 +422,7 @@ enum req_flag_bits {
  	__REQ_DRV,		/* for driver use */
  	__REQ_FS_PRIVATE,	/* for file system (submitter) use */

+	__REQ_ATOMIC,		/* for atomic write operations */
  	/*
  	 * Command specific flags, keep last:
  	 */
@@ -448,6 +449,7 @@ enum req_flag_bits {
  #define REQ_RAHEAD	(__force blk_opf_t)(1ULL << __REQ_RAHEAD)
  #define REQ_BACKGROUND	(__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
  #define REQ_NOWAIT	(__force blk_opf_t)(1ULL << __REQ_NOWAIT)
+#define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)

Let's add this in the same order as of __REQ_ATOMIC i.e. after
REQ_FS_PRIVATE macro

ok, fine

>> @@ -299,6 +299,14 @@ struct queue_limits {
>>   	unsigned int		discard_alignment;
>>   	unsigned int		zone_write_granularity;
>>
>> +	unsigned int		atomic_write_hw_max_sectors;
>> +	unsigned int		atomic_write_max_sectors;
>> +	unsigned int		atomic_write_hw_boundary_sectors;
>> +	unsigned int		atomic_write_hw_unit_min_sectors;
>> +	unsigned int		atomic_write_unit_min_sectors;
>> +	unsigned int		atomic_write_hw_unit_max_sectors;
>> +	unsigned int		atomic_write_unit_max_sectors;
>> +
> 1 liner comment for above members please?

ok


+static inline bool bdev_can_atomic_write(struct block_device *bdev)
+{
+	struct request_queue *bd_queue = bdev->bd_queue;
+	struct queue_limits *limits = &bd_queue->limits;
+
+	if (!limits->atomic_write_unit_min_sectors)
+		return false;
+
+	if (bdev_is_partition(bdev)) {
+		sector_t bd_start_sect = bdev->bd_start_sect;
+		unsigned int granularity = max(

atomic_align perhaps?

or just "align"


+				limits->atomic_write_unit_min_sectors,
+				limits->atomic_write_hw_boundary_sectors);
+		if (do_div(bd_start_sect, granularity))
+			return false;
+	}

since atomic_align is a power of 2. Why not use IS_ALIGNED()?
(bitwise operation instead of div)?

already changed as advised

Thanks,
John