Re: [PATCH 4/6] fs: xfs: Support atomic write for statx

Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx> · Mon, 12 Feb 2024 17:18:17 +0530

On Fri, Feb 09, 2024 at 05:30:50PM +0000, John Garry wrote:
> 
> > > +void xfs_get_atomic_write_attr(
> > > +	struct xfs_inode *ip,
> > > +	unsigned int *unit_min,
> > > +	unsigned int *unit_max)
> > > +{
> > > +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
> > > +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > > +	struct block_device	*bdev = target->bt_bdev;
> > > +	unsigned int		awu_min, awu_max, align;
> > > +	struct request_queue	*q = bdev->bd_queue;
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +
> > > +	/*
> > > +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
> > > +	 * atomic write unit of BLOCKSIZE).
> > > +	 */
> > > +	awu_min = queue_atomic_write_unit_min_bytes(q);
> > > +	awu_max = queue_atomic_write_unit_max_bytes(q);
> > > +
> > > +	awu_min &= ~mp->m_blockmask;
> > > +	awu_max &= ~mp->m_blockmask;
> > 
> > I don't understand why we try to round down the awu_max to blocks size
> > here and not just have an explicit check of (awu_max < blocksize).
> We have later check for !awu_max:
> 
> if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
> ...
> 
> So what we are doing is ensuring that the awu_max which the device reports
> is at least FS block size. If it is not, then we cannot support atomic
> writes.
> 
> Indeed, my NVMe drive reports awu_max  = 4K. So for XFS configured for 64K
> block size, we will say that we don't support atomic writes.

> 
> > 
> > I think the issue with changing the awu_max is that we are using awu_max
> > to also indirectly reflect the alignment so as to ensure we don't cross
> > atomic boundaries set by the hw (eg we check uint_max % atomic alignment
> > == 0 in scsi). So once we change the awu_max, there's a chance that even
> > if an atomic write aligns to the new awu_max it still doesn't have the
> > right alignment and fails.
> 
> All these values should be powers-of-2, so rounding down should not affect
> whether we would not cross an atomic write boundary.

> 
> > 
> > It works right now since eveything is power of 2 but it should cause
> > issues incase we decide to remove that limitation.
> 
> Sure, but that is a fundamental principle of this atomic write support. Not
> having powers-of-2 requirement for atomic writes will affect many things.
> 

Correct, so the only reason for the rounding down is to ensure that
awu_max is not smaller than our block size but the way these checks are
right now doesn't make it obvious. It also raises questions like why we
are changing these min and max values especially why are we rounding
*down* min.

I think we should just have explicit (unit_[min/max] < bs) checks
without trying to round down the values.

> > Anyways, I think
> > this implicit behavior of things working since eveything is a power of 2
> > should atleast be documented in a comment, so these things are
> > immediately clear.
> > 
> > > +
> > > +	align = XFS_FSB_TO_B(mp, extsz);
> > > +
> > > +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
> > > +	    !is_power_of_2(align)) {
> > 
> > Correct me if I'm wrong but here as well, the is_power_of_2(align) is
> > esentially checking if the align % uinit_max == 0 (or vice versa if
> > unit_max is greater)
> 
> yes
> 
> > so that an allocation of extsize will always align
> > nicely as needed by the device.
> > 
> 
> I'm trying to keep things simple now.
> 
> In theory we could allow, say, align == 12 FSB, and then could say awu_max =
> 4.
> 
> The same goes for atomic write boundary in NVMe. Currently we say that it
> needs to be a power-of-2. However, it really just needs to be a multiple of
> awu_max. So if some HW did report a !power-of-2 atomic write boundary, we
> could reduce awu_max reported until to fits the power-of-2 rule and also is
> cleanly divisible into atomic write boundary. But that is just not what HW
> will report (I expect). We live in a power-of-2 data granularity world.

True, we ideally won't expect the hw to report that but why not just
make the check as (awu_max % align) so that:

1. intention is immediately clear
2. It'll directly work for non power of 2 boundary in future without
change.

Just my 2 cents.
> 
> > So maybe we should use the % expression explicitly so that the intention
> > is immediately clear.
> 
> As mentioned, I wanted to keep it simple. In addition, it's a bit of a mess
> for the FS block allocator to work with odd sizes, like 12. And it does not
> suit RAID stripe alignment, which works in powers-of-2.
> 
> > 
> > > +		*unit_min = 0;
> > > +		*unit_max = 0;
> > > +	} else {
> > > +		if (awu_min)
> > > +			*unit_min = min(awu_min, align);
> > 
> > How will the min() here work? If awu_min is the minumum set by the
> > device, how can statx be allowed to advertise something smaller than
> > that?
> The idea is that is that if the awu_min reported by the device is less than
> the FS block size, then we report awu_min = FS block size. We already know
> that awu_max => FS block size, since we got this far, so saying that awu_min
> = FS block size is ok.
> 
> Otherwise it is the minimum of alignment and awu_min. I suppose that does
> not make much sense, and we should just always require awu_min = FS block
> size.

Yep, that also works. We should just set the minimum to FS block size,
since that min() operator there is confusing.

Regards,
ojaswin