Re: [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol

John Garry <john.g.garry@xxxxxxxxxx> · Mon, 5 Feb 2024 12:51:07 +0000

On 02/02/2024 17:52, Darrick J. Wong wrote:
On Wed, Jan 24, 2024 at 02:26:42PM +0000, John Garry wrote:
Add initial support for FS_XFLAG_ATOMICWRITES in rtvol.

Current kernel support for atomic writes is based on HW support (for atomic
writes). As such, it is required to ensure extent alignment with
atomic_write_unit_max so that an atomic write can result in a single
HW-compliant IO operation.

rtvol already guarantees extent alignment, so initially add support there.

Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx>
---
  fs/xfs/libxfs/xfs_format.h |  8 ++++++--
  fs/xfs/libxfs/xfs_sb.c     |  2 ++
  fs/xfs/xfs_inode.c         | 22 ++++++++++++++++++++++
  fs/xfs/xfs_inode.h         |  7 +++++++
  fs/xfs/xfs_ioctl.c         | 19 +++++++++++++++++--
  fs/xfs/xfs_mount.h         |  2 ++
  fs/xfs/xfs_super.c         |  4 ++++
  7 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 382ab1e71c0b..79fb0d4adeda 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,11 +353,13 @@ xfs_sb_has_compat_feature(
  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 29)	/* aligned file data extents */

I thought FORCEALIGN was going to signal aligned file data extent
allocations being mandatory?

Right, I'll fix that comment


This flag (AFAICT) simply marks the inode as something that gets
FMODE_CAN_ATOMIC_WRITES, right?

Correct


  #define XFS_SB_FEAT_RO_COMPAT_ALL \
  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
  		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
  static inline bool
  xfs_sb_has_ro_compat_feature(
@@ -1085,16 +1087,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
  #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
  #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+#define XFS_DIFLAG2_ATOMICWRITES_BIT 6

Needs a comment here ("files flagged for atomic writes"). 

ok

Also not sure
why you skipped bit 5, though I'm guessing it's because the forcealign
series is/was using it?

Right, I'll fix that


  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
  #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
  #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
  
  #define XFS_DIFLAG2_ANY \
  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_ATOMICWRITES)
  
  static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
  {
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 4a9e8588f4c9..28a98130a56d 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -163,6 +163,8 @@ xfs_sb_version_to_features(
  		features |= XFS_FEAT_REFLINK;
  	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
  		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+		features |= XFS_FEAT_ATOMICWRITES;
  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
  		features |= XFS_FEAT_FTYPE;
  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1fd94958aa97..0b0f525fd043 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -65,6 +65,26 @@ xfs_get_extsz_hint(
  	return 0;
  }
  
+/*
+ * helper function to extract extent size

How does that differ from xfs_get_extsz_hint?

The idea of this function is to return the guaranteed extent alignment, 
and not just the hint


+ */
+xfs_extlen_t
+xfs_get_extsz(
+	struct xfs_inode	*ip)
+{
+	/*
+	 * No point in aligning allocations if we need to COW to actually
+	 * write to them.

What does alwayscow have to do with untorn writes?

Nothing at the moment, so I'll remove this.


+	 */
+	if (xfs_is_always_cow_inode(ip))
+		return 0;
+
+	if (XFS_IS_REALTIME_INODE(ip))
+		return ip->i_mount->m_sb.sb_rextsize;
+
+	return 1;
+}

Does this function exist to return the allocation unit for a given file?
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=b8ddcef3df8da02ed2c4aacbed1d811e60372006


Yes, something like xfs_inode_alloc_unitsize() there.

What's the upstream status for that change? I see it mentioned in 
linux-xfs lore and seems to be part of a mega patchset.

+
  /*
   * Helper function to extract CoW extent size hint from inode.
   * Between the extent size hint and the CoW extent size hint, we
@@ -629,6 +649,8 @@ xfs_ip2xflags(
  			flags |= FS_XFLAG_DAX;
  		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
  			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+			flags |= FS_XFLAG_ATOMICWRITES;
  	}
  
  	if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 97f63bacd4c2..0e0a21d9d30f 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
  	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
  }
  
+static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)

I think this predicate wants a verb in its name, the rest of them have
"is" or "has" somewhere:

"xfs_inode_has_atomicwrites"

ok, fine.

Note that I was copying xfs_inode_forcealign() in terms of naming.


+{
+	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
+}
+
  /*
   * Return the buftarg used for data allocations on a given inode.
   */
@@ -542,7 +547,9 @@ void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
  				struct xfs_inode *ip1, uint ip1_mode);
  
  xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
+xfs_extlen_t	xfs_get_extsz(struct xfs_inode *ip);
  xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
+xfs_extlen_t	xfs_get_atomicwrites_size(struct xfs_inode *ip);
  
  int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
  		struct xfs_inode *pip, xfs_ino_t ino, umode_t mode,
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index f02b6e558af5..c380a3055be7 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
  		di_flags2 |= XFS_DIFLAG2_DAX;
  	if (xflags & FS_XFLAG_COWEXTSIZE)
  		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	if (xflags & FS_XFLAG_ATOMICWRITES)
+		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
  
  	return di_flags2;
  }
@@ -1122,10 +1124,12 @@ xfs_ioctl_setattr_xflags(
  {
  	struct xfs_mount	*mp = ip->i_mount;
  	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
+	bool			atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
  	uint64_t		i_flags2;
  
-	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
-		/* Can't change realtime flag if any extents are allocated. */

Please augment this comment ("Can't change realtime or atomicwrites
flags if any extents are allocated") instead of deleting it.

I wasn't supposed to delete that - will remedy.

 This is
validation code, the requirements should be spelled out in English.

+
+	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
+	    atomic_writes != xfs_inode_atomicwrites(ip)) {
  		if (ip->i_df.if_nextents || ip->i_delayed_blks)
  			return -EINVAL;
  	}
@@ -1146,6 +1150,17 @@ xfs_ioctl_setattr_xflags(
  	if (i_flags2 && !xfs_has_v3inodes(mp))
  		return -EINVAL;
  
+	if (atomic_writes) {
+		if (!xfs_has_atomicwrites(mp))
+			return -EINVAL;
+
+		if (!rtflag)
+			return -EINVAL;
+
+		if (!is_power_of_2(mp->m_sb.sb_rextsize))
+			return -EINVAL;

Shouldn't we check sb_rextsize w.r.t. the actual block device queue
limits here?  I keep seeing similar validation logic open-coded
throughout both atomic write patchsets:

	if (l < queue_atomic_write_unit_min_bytes())
		/* fail */
	if (l > queue_atomic_write_unit_max_bytes())
		/* fail */
	if (!is_power_of_2(l))
		/* fail */
	/* ok */

which really should be a common helper somewhere.

I think that it is a reasonable comment about duplication the atomic 
writes checks for the bdev and iomap write paths - I can try to improve 
that.

But the is_power_of_2(mp->m_sb.sb_rextsize) check is to ensure that the 
extent size is suitable for enabling atomic writes. I don't see a point 
in checking the bdev queue limits here.


		/*
		 * Don't set atomic write if the allocation unit doesn't
		 * align with the device requirements.
		 */
		if (!bdev_validate_atomic_write(<target blockdev>,
				XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize))
			return -EINVAL;

Too bad we have to figure out the target blockdev and file allocation
unit based on the ioctl in-params and can't use the xfs_inode helpers
here.

I am not sure what bdev_validate_atomic_write() would even do. If 
sb_rextsize exceeded the bdev atomic write unit max, then we just cap 
reported atomic write unit max in statx to that which the bdev reports 
and vice-versa.

And didn't we previously have a concern that it is possible to change 
the geometry of the device? If so, not much point in this check.

Thanks,
John