+void xfs_get_atomic_write_attr(
+ struct xfs_inode *ip,
+ unsigned int *unit_min,
+ unsigned int *unit_max)
+{
+ xfs_extlen_t extsz = xfs_get_extsz(ip);
+ struct xfs_buftarg *target = xfs_inode_buftarg(ip);
+ struct block_device *bdev = target->bt_bdev;
+ unsigned int awu_min, awu_max, align;
+ struct request_queue *q = bdev->bd_queue;
+ struct xfs_mount *mp = ip->i_mount;
+
+ /*
+ * Convert to multiples of the BLOCKSIZE (as we support a minimum
+ * atomic write unit of BLOCKSIZE).
+ */
+ awu_min = queue_atomic_write_unit_min_bytes(q);
+ awu_max = queue_atomic_write_unit_max_bytes(q);
+
+ awu_min &= ~mp->m_blockmask;
+ awu_max &= ~mp->m_blockmask;
I don't understand why we try to round down the awu_max to blocks size
here and not just have an explicit check of (awu_max < blocksize).
We have later check for !awu_max:
if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
...
So what we are doing is ensuring that the awu_max which the device
reports is at least FS block size. If it is not, then we cannot support
atomic writes.
Indeed, my NVMe drive reports awu_max = 4K. So for XFS configured for
64K block size, we will say that we don't support atomic writes.
I think the issue with changing the awu_max is that we are using awu_max
to also indirectly reflect the alignment so as to ensure we don't cross
atomic boundaries set by the hw (eg we check uint_max % atomic alignment
== 0 in scsi). So once we change the awu_max, there's a chance that even
if an atomic write aligns to the new awu_max it still doesn't have the
right alignment and fails.
All these values should be powers-of-2, so rounding down should not
affect whether we would not cross an atomic write boundary.
It works right now since eveything is power of 2 but it should cause
issues incase we decide to remove that limitation.
Sure, but that is a fundamental principle of this atomic write support.
Not having powers-of-2 requirement for atomic writes will affect many
things.
Anyways, I think
this implicit behavior of things working since eveything is a power of 2
should atleast be documented in a comment, so these things are
immediately clear.
+
+ align = XFS_FSB_TO_B(mp, extsz);
+
+ if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
+ !is_power_of_2(align)) {
Correct me if I'm wrong but here as well, the is_power_of_2(align) is
esentially checking if the align % uinit_max == 0 (or vice versa if
unit_max is greater)
yes
so that an allocation of extsize will always align
nicely as needed by the device.
I'm trying to keep things simple now.
In theory we could allow, say, align == 12 FSB, and then could say
awu_max = 4.
The same goes for atomic write boundary in NVMe. Currently we say that
it needs to be a power-of-2. However, it really just needs to be a
multiple of awu_max. So if some HW did report a !power-of-2 atomic write
boundary, we could reduce awu_max reported until to fits the power-of-2
rule and also is cleanly divisible into atomic write boundary. But that
is just not what HW will report (I expect). We live in a power-of-2 data
granularity world.
So maybe we should use the % expression explicitly so that the intention
is immediately clear.
As mentioned, I wanted to keep it simple. In addition, it's a bit of a
mess for the FS block allocator to work with odd sizes, like 12. And it
does not suit RAID stripe alignment, which works in powers-of-2.
+ *unit_min = 0;
+ *unit_max = 0;
+ } else {
+ if (awu_min)
+ *unit_min = min(awu_min, align);
How will the min() here work? If awu_min is the minumum set by the
device, how can statx be allowed to advertise something smaller than
that?
The idea is that is that if the awu_min reported by the device is less
than the FS block size, then we report awu_min = FS block size. We
already know that awu_max => FS block size, since we got this far, so
saying that awu_min = FS block size is ok.
Otherwise it is the minimum of alignment and awu_min. I suppose that
does not make much sense, and we should just always require awu_min = FS
block size.
If I understand correctly, right now the way we set awu_min in scsi and
nvme, the follwoing should usually be true for a sane device:
awu_min <= blocks size of fs <= align
so the min() anyways becomes redundant, but if we do assume that there
might be some weird devices with awu_min absurdly large (SCSI with
high atomic granularity) we still can't actually advertise a min
smaller than that of the device, or am I missing something here?
As above, I might just ensure that we can do awu_min = FS block size and
not deal with crazy devices.
Thanks,
John