Re: [PATCH] fs: Add a new flag RWF_IOWAIT for preadv2(2)

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 6 Aug 2024 19:54:58 +0800

On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > Background
> > ==========
> >
> > Our big data workloads are deployed on XFS-based disks, and we frequently
> > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > different applications may access the same files concurrently. For example,
> > while a datanode task is writing to a file, a filebeat[0] task might be
> > reading the same file concurrently. If the task writing to the file takes a
> > long time, the task reading the file will hang due to contention on the XFS
> > inode lock.
> >
> > This inode lock contention between writing and reading files only occurs on
> > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > explanation for why this occurs only on XFS[1]:
> >
> >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> >   : Atomic means that all the bytes from a single operation that started
> >   : out together end up together, without interleaving from other I/O
> >   : operations. [2]
> >   : XFS is the only linux filesystem that provides this behaviour.
> >
> > As we have been running big data on XFS for years, we don't want to switch
> > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > within XFS.
> >
> > Proposal
> > ========
> >
> > One solution we're currently exploring is leveraging the preadv2(2)
> > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > lock hung task. This can be illustrated as follows:
> >
> >   retry:
> >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> >           sleep(n)
> >           goto retry;
> >       }
> >
> > Since the tasks reading the same files are not critical tasks, a delay in
> > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > will loop indefinitely until someone else reads it from disk, which is not
> > acceptable.
> >
> > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > flag will allow reading from the disk if the file is not in the page cache
> > but will not allow waiting for the lock if it is held by others. With this
> > new flag, we can resolve our issues effectively.
> >
> > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > Cc: Dave Chinner <david@xxxxxxxxxxxxx>
>
> Thanks for the detailed explanation! I understand your problem but I have to
> say I find this flag like a hack to workaround particular XFS behavior and
> the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> me.

Its guarantee is clear:

  : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
  : Atomic means that all the bytes from a single operation that started
  : out together end up together, without interleaving from other I/O
  : operations.

What this flag does is avoid waiting for this type of lock if it
exists. Maybe we should consider a more descriptive name like
RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
challenging.

Since this behavior is required by POSIX, it shouldn't be viewed as an
XFS-specific behavior. Other filesystems might adopt this rule in the
future as well.

> I've CCed Amir who's been dealing with similar issues with XFS at his
> employer and had some patches as far as I remember.
>
> What you could possibly do to read the file contents without blocking on
> xfs_iolock is to mmap the file and grab the data from the mapping. It is
> still hacky but at least we don't have to pollute the kernel with an IO
> flag with unclear semantics.

The file size to be read is not fixed, which is why we prefer to use
the traditional read API rather than mmap. We have implemented a
hotfix version of this commit on many of our production servers, and
it works well as expected. While I agree that mmap() is another viable
option, we may consider switching to it in the future if this new flag
introduces any issues.

-- 
Regards
Yafang