Re: [PATCH 4/6] fs: xfs: Support atomic write for statx

John Garry <john.g.garry@xxxxxxxxxx> · Fri, 9 Feb 2024 17:30:50 +0000

+void xfs_get_atomic_write_attr(
+	struct xfs_inode *ip,
+	unsigned int *unit_min,
+	unsigned int *unit_max)
+{
+	xfs_extlen_t		extsz = xfs_get_extsz(ip);
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct block_device	*bdev = target->bt_bdev;
+	unsigned int		awu_min, awu_max, align;
+	struct request_queue	*q = bdev->bd_queue;
+	struct xfs_mount	*mp = ip->i_mount;
+
+	/*
+	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
+	 * atomic write unit of BLOCKSIZE).
+	 */
+	awu_min = queue_atomic_write_unit_min_bytes(q);
+	awu_max = queue_atomic_write_unit_max_bytes(q);
+
+	awu_min &= ~mp->m_blockmask;
+	awu_max &= ~mp->m_blockmask;

I don't understand why we try to round down the awu_max to blocks size
here and not just have an explicit check of (awu_max < blocksize).
We have later check for !awu_max:

if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
...

So what we are doing is ensuring that the awu_max which the device 
reports is at least FS block size. If it is not, then we cannot support 
atomic writes.

Indeed, my NVMe drive reports awu_max  = 4K. So for XFS configured for 
64K block size, we will say that we don't support atomic writes.

I think the issue with changing the awu_max is that we are using awu_max
to also indirectly reflect the alignment so as to ensure we don't cross
atomic boundaries set by the hw (eg we check uint_max % atomic alignment
== 0 in scsi). So once we change the awu_max, there's a chance that even
if an atomic write aligns to the new awu_max it still doesn't have the
right alignment and fails.

All these values should be powers-of-2, so rounding down should not 
affect whether we would not cross an atomic write boundary.

It works right now since eveything is power of 2 but it should cause
issues incase we decide to remove that limitation. 

Sure, but that is a fundamental principle of this atomic write support. 
Not having powers-of-2 requirement for atomic writes will affect many 
things.

Anyways, I think
this implicit behavior of things working since eveything is a power of 2
should atleast be documented in a comment, so these things are
immediately clear.

+
+	align = XFS_FSB_TO_B(mp, extsz);
+
+	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
+	    !is_power_of_2(align)) {

Correct me if I'm wrong but here as well, the is_power_of_2(align) is
esentially checking if the align % uinit_max == 0 (or vice versa if
unit_max is greater) 

yes

so that an allocation of extsize will always align
nicely as needed by the device.

I'm trying to keep things simple now.

In theory we could allow, say, align == 12 FSB, and then could say 
awu_max = 4.

The same goes for atomic write boundary in NVMe. Currently we say that 
it needs to be a power-of-2. However, it really just needs to be a 
multiple of awu_max. So if some HW did report a !power-of-2 atomic write 
boundary, we could reduce awu_max reported until to fits the power-of-2 
rule and also is cleanly divisible into atomic write boundary. But that 
is just not what HW will report (I expect). We live in a power-of-2 data 
granularity world.

So maybe we should use the % expression explicitly so that the intention
is immediately clear.

As mentioned, I wanted to keep it simple. In addition, it's a bit of a 
mess for the FS block allocator to work with odd sizes, like 12. And it 
does not suit RAID stripe alignment, which works in powers-of-2.

+		*unit_min = 0;
+		*unit_max = 0;
+	} else {
+		if (awu_min)
+			*unit_min = min(awu_min, align);

How will the min() here work? If awu_min is the minumum set by the
device, how can statx be allowed to advertise something smaller than
that?
The idea is that is that if the awu_min reported by the device is less 
than the FS block size, then we report awu_min = FS block size. We 
already know that awu_max => FS block size, since we got this far, so 
saying that awu_min = FS block size is ok.

Otherwise it is the minimum of alignment and awu_min. I suppose that 
does not make much sense, and we should just always require awu_min = FS 
block size.

If I understand correctly, right now the way we set awu_min in scsi and
nvme, the follwoing should usually be true for a sane device:

  awu_min <= blocks size of fs <= align

  so the min() anyways becomes redundant, but if we do assume that there
  might be some weird devices with awu_min absurdly large (SCSI with
  high atomic granularity) we still can't actually advertise a min
  smaller than that of the device, or am I missing something here?

As above, I might just ensure that we can do awu_min = FS block size and 
not deal with crazy devices.

Thanks,
John