Re: [PATCH] fs: Add a new flag RWF_IOWAIT for preadv2(2)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 6, 2024 at 9:24 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@xxxxxxx> wrote:
> > > On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > Our big data workloads are deployed on XFS-based disks, and we frequently
> > > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > > > different applications may access the same files concurrently. For example,
> > > > while a datanode task is writing to a file, a filebeat[0] task might be
> > > > reading the same file concurrently. If the task writing to the file takes a
> > > > long time, the task reading the file will hang due to contention on the XFS
> > > > inode lock.
> > > >
> > > > This inode lock contention between writing and reading files only occurs on
> > > > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > > > explanation for why this occurs only on XFS[1]:
> > > >
> > > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > >   : Atomic means that all the bytes from a single operation that started
> > > >   : out together end up together, without interleaving from other I/O
> > > >   : operations. [2]
> > > >   : XFS is the only linux filesystem that provides this behaviour.
> > > >
> > > > As we have been running big data on XFS for years, we don't want to switch
> > > > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > > > within XFS.
> > > >
> > > > Proposal
> > > > ========
> > > >
> > > > One solution we're currently exploring is leveraging the preadv2(2)
> > > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > > > lock hung task. This can be illustrated as follows:
> > > >
> > > >   retry:
> > > >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> > > >           sleep(n)
> > > >           goto retry;
> > > >       }
> > > >
> > > > Since the tasks reading the same files are not critical tasks, a delay in
> > > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > > > will loop indefinitely until someone else reads it from disk, which is not
> > > > acceptable.
> > > >
> > > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > > > flag will allow reading from the disk if the file is not in the page cache
> > > > but will not allow waiting for the lock if it is held by others. With this
> > > > new flag, we can resolve our issues effectively.
> > > >
> > > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > > > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > > > Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> > >
> > > Thanks for the detailed explanation! I understand your problem but I have to
> > > say I find this flag like a hack to workaround particular XFS behavior and
> > > the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> > > me.
> >
> > Its guarantee is clear:
> >
> >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> >   : Atomic means that all the bytes from a single operation that started
> >   : out together end up together, without interleaving from other I/O
> >   : operations.
>
> Oh, I understand why XFS does locking this way and I'm well aware this is
> a requirement in POSIX. However, as you have experienced, it has a
> significant performance cost for certain workloads (at least with simple
> locking protocol we have now) and history shows users rather want the extra
> performance at the cost of being a bit more careful in userspace. So I
> don't see any filesystem switching to XFS behavior until we have a
> performant range locking primitive.
>
> > What this flag does is avoid waiting for this type of lock if it
> > exists. Maybe we should consider a more descriptive name like
> > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > challenging.
>
> Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> exclusion". I'm still not convinced the flag is a great idea but
> RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.

That's better. Should we proceed with implementing this new flag? It
provides users with an option to avoid this type of issue.

-- 
Regards
Yafang





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux