Re: [PATCH] fs: Add a new flag RWF_IOWAIT for preadv2(2)

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 6 Aug 2024 15:47:43 +1000

On Sun, Aug 04, 2024 at 04:02:51PM +0800, Yafang Shao wrote:
> Background
> ==========
> 
> Our big data workloads are deployed on XFS-based disks, and we frequently
> encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> different applications may access the same files concurrently. For example,
> while a datanode task is writing to a file, a filebeat[0] task might be
> reading the same file concurrently. If the task writing to the file takes a
> long time, the task reading the file will hang due to contention on the XFS
> inode lock.
>
> This inode lock contention between writing and reading files only occurs on
> XFS, but not on other file systems such as EXT4. Dave provided a clear
> explanation for why this occurs only on XFS[1]:
> 
>   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
>   : Atomic means that all the bytes from a single operation that started
>   : out together end up together, without interleaving from other I/O
>   : operations. [2]
>   : XFS is the only linux filesystem that provides this behaviour.
> 
> As we have been running big data on XFS for years, we don't want to switch
> to other file systems like EXT4. Therefore, we plan to resolve these issues
> within XFS.

I've been looking at range locks again in the past few days because,
once again, the need for range locking to allow exclusive range
based operations to take place whilst concurrent IO is occurring has
arisen. We need to be able to clone, unshare, punch holes, exchange
extents, etc without interrupting ongoing IO to the same file.

This is just another one of the cases where range locking will solve
the problems you are having without giving up the atomic write vs
read behaviour posix asks us to provide...

> Proposal
> ========
> 
> One solution we're currently exploring is leveraging the preadv2(2)
> syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> lock hung task. This can be illustrated as follows:
> 
>   retry:
>       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
>           sleep(n)
>           goto retry;
>       }

Hmmm.

> Since the tasks reading the same files are not critical tasks, a delay in
> reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> will loop indefinitely until someone else reads it from disk, which is not
> acceptable.
> 
> So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> flag will allow reading from the disk if the file is not in the page cache
> but will not allow waiting for the lock if it is held by others. With this
> new flag, we can resolve our issues effectively.
> 
> Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> ---
>  include/linux/fs.h      | 6 ++++++
>  include/uapi/linux/fs.h | 5 ++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..5df7b5b0927a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3472,6 +3472,12 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
>  			return -EPERM;
>  		ki->ki_flags &= ~IOCB_APPEND;
>  	}
> +	if (flags & RWF_IOWAIT) {
> +		kiocb_flags |= IOCB_NOWAIT;
> +		/* IOCB_NOIO is not allowed for RWF_IOWAIT */
> +		if (kiocb_flags & IOCB_NOIO)
> +			return -EINVAL;
> +	}

I'm not sure that this will be considered an acceptible workaround
for what is largely considered by most Linux filesystem developers
an anchronistic filesystem behaviour. I don't really want people to
work around this XFS behaviour, either - waht I'd like to see is
more people putting effort into trying to solve the range locking
problem...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx