Re: [PATCH] fs: Add a new flag RWF_IOWAIT for preadv2(2)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@xxxxxxx> wrote:
> > On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > > Background
> > > ==========
> > >
> > > Our big data workloads are deployed on XFS-based disks, and we frequently
> > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > > different applications may access the same files concurrently. For example,
> > > while a datanode task is writing to a file, a filebeat[0] task might be
> > > reading the same file concurrently. If the task writing to the file takes a
> > > long time, the task reading the file will hang due to contention on the XFS
> > > inode lock.
> > >
> > > This inode lock contention between writing and reading files only occurs on
> > > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > > explanation for why this occurs only on XFS[1]:
> > >
> > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > >   : Atomic means that all the bytes from a single operation that started
> > >   : out together end up together, without interleaving from other I/O
> > >   : operations. [2]
> > >   : XFS is the only linux filesystem that provides this behaviour.
> > >
> > > As we have been running big data on XFS for years, we don't want to switch
> > > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > > within XFS.
> > >
> > > Proposal
> > > ========
> > >
> > > One solution we're currently exploring is leveraging the preadv2(2)
> > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > > lock hung task. This can be illustrated as follows:
> > >
> > >   retry:
> > >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> > >           sleep(n)
> > >           goto retry;
> > >       }
> > >
> > > Since the tasks reading the same files are not critical tasks, a delay in
> > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > > will loop indefinitely until someone else reads it from disk, which is not
> > > acceptable.
> > >
> > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > > flag will allow reading from the disk if the file is not in the page cache
> > > but will not allow waiting for the lock if it is held by others. With this
> > > new flag, we can resolve our issues effectively.
> > >
> > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > > Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> >
> > Thanks for the detailed explanation! I understand your problem but I have to
> > say I find this flag like a hack to workaround particular XFS behavior and
> > the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> > me.
> 
> Its guarantee is clear:
> 
>   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
>   : Atomic means that all the bytes from a single operation that started
>   : out together end up together, without interleaving from other I/O
>   : operations.

Oh, I understand why XFS does locking this way and I'm well aware this is
a requirement in POSIX. However, as you have experienced, it has a
significant performance cost for certain workloads (at least with simple
locking protocol we have now) and history shows users rather want the extra
performance at the cost of being a bit more careful in userspace. So I
don't see any filesystem switching to XFS behavior until we have a
performant range locking primitive.

> What this flag does is avoid waiting for this type of lock if it
> exists. Maybe we should consider a more descriptive name like
> RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> challenging.

Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
exclusion". I'm still not convinced the flag is a great idea but
RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux