Re: [RFC PATCH] xfs: skip discard of unwritten extents

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 1 May 2018 08:00:20 +1000

On Thu, Apr 26, 2018 at 02:06:24PM -0400, Brian Foster wrote:
> Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> ---
> 
> Hi all,
> 
> What do folks think of something like this?

Definitely sounds like something we need to address.

> The motivation here is that
> the VDO (dedup) devs had reported seeing online discards during
> write-only workloads. These turn out to be related to trimming post-eof
> preallocation blocks after large file copies. To my knowledge, this
> isn't really a prevalent or serious issue, but I think that technically
> these discards are unnecessary and so I was looking into how we could
> avoid them.

We simply trucate post-eof extents, right? So we know in
xfs_itruncate_extents() if the inode size is changing, not to
mention we know if the extent is beyond EOF? e.g. all calls to
xfs_itruncate_extents() other than xfs_free_eofblocks() change the
inode size and so directly indicate they are removing written
blocks. Anything where the inode size is not changing is doing a
post-eof removal, and so we can assume no data has been written?

So rather than converting everything to unwritten extents, the "skip
discard flag" is simply triggered via extents being freed sitting
beyond the current EOF (not the new EOF) and/or being unwritten?

> This behavior is of course not directly related to unwritten extents,
> but the immediate/obvious solution to bubble up a bmapi flag of some
> kind to xfs_free_eofblocks() seemed rather crude. From there, I figured
> that we technically don't need to discard any unwritten extents (within
> or beyond EOF) because they haven't been written to since being
> allocated. In fact, I'm not sure we have to even busy them, but it's
> roughly equivalent logic either way and I'm trying to avoid getting too
> clever.

I think we still need to busy them to avoid re-allocating them in
the same checkpoint, as data extent free/realloc in the same
checkpoint could result in a failed recovery (i.e. partial
checkpoint replay) leaving the extent linked into two separate
files.

> I also recall that we've discussed using unwritten extents for delalloc
> -> real conversion to avoid the small stale data exposure window that
> exists in writeback. Without getting too deep into the reason we don't
> currently do an initial unwritten allocation [1], I don't think there's
> anything blocking us from converting any post-eof blocks that happen to
> be part of the resulting normal allocation. As it is, the imap is
> already trimmed to EOF by the writeback code for coherency reasons. If
> we were to convert post-eof blocks (not part of this patch) along with
> something like this patch, then we'd indirectly prevent discards for
> eofblocks trims.

I think we should leave that as a separate problem, as writeback
currently has issues with the way we manage bufferhead state.
i.e. things don't work right if we put unwritten extents under
delalloc buffers and vice versa. [ I have patches to address that
I'm working on.] And there's also the issue that we need to change
the delalloc reservations to take into account block allocations
required by unwritten extent conversion needed by delalloc.

Hence I think we should address that as a separate problem, not as
the solution to avoiding discard of post-eof extents.

> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 4bcc095fe44a..942c90ec6747 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2954,13 +2954,15 @@ xfs_free_extent(
>  	xfs_fsblock_t		bno,	/* starting block number of extent */
>  	xfs_extlen_t		len,	/* length of extent */
>  	struct xfs_owner_info	*oinfo,	/* extent owner */
> -	enum xfs_ag_resv_type	type)	/* block reservation type */
> +	enum xfs_ag_resv_type	type,	/* block reservation type */
> +	bool			skip_discard)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*agbp;
>  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, bno);
>  	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, bno);
>  	int			error;
> +	unsigned int		busy_flags = 0;
>  
>  	ASSERT(len != 0);
>  	ASSERT(type != XFS_AG_RESV_AGFL);
> @@ -2984,7 +2986,9 @@ xfs_free_extent(
>  	if (error)
>  		goto err;
>  
> -	xfs_extent_busy_insert(tp, agno, agbno, len, 0);
> +	if (skip_discard)
> +		busy_flags |= XFS_EXTENT_BUSY_SKIP_DISCARD;
> +	xfs_extent_busy_insert(tp, agno, agbno, len, busy_flags);
>  	return 0;

Rather than changing xfs_free_extent(), how about adding a
xfs_free_extent_nodiscard() wrapper, and only call it when
processing an extent that doesn't need discard? This means none of
the other code that frees extents needs to be changed. Similarly
add a xfs_bmap_add_free_nodiscard() wrapper. That will cut down on
the code churn caused by passing new parameters everywhere....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html