Re: [PATCH] fs: Add a new flag RWF_IOWAIT for preadv2(2)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 06, 2024 at 03:24:32PM GMT, Jan Kara wrote:
> On Tue 06-08-24 19:54:58, Yafang Shao wrote:
> > On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@xxxxxxx> wrote:
> > > On Sun 04-08-24 16:02:51, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > Our big data workloads are deployed on XFS-based disks, and we frequently
> > > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because
> > > > different applications may access the same files concurrently. For example,
> > > > while a datanode task is writing to a file, a filebeat[0] task might be
> > > > reading the same file concurrently. If the task writing to the file takes a
> > > > long time, the task reading the file will hang due to contention on the XFS
> > > > inode lock.
> > > >
> > > > This inode lock contention between writing and reading files only occurs on
> > > > XFS, but not on other file systems such as EXT4. Dave provided a clear
> > > > explanation for why this occurs only on XFS[1]:
> > > >
> > > >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> > > >   : Atomic means that all the bytes from a single operation that started
> > > >   : out together end up together, without interleaving from other I/O
> > > >   : operations. [2]
> > > >   : XFS is the only linux filesystem that provides this behaviour.
> > > >
> > > > As we have been running big data on XFS for years, we don't want to switch
> > > > to other file systems like EXT4. Therefore, we plan to resolve these issues
> > > > within XFS.
> > > >
> > > > Proposal
> > > > ========
> > > >
> > > > One solution we're currently exploring is leveraging the preadv2(2)
> > > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode
> > > > lock hung task. This can be illustrated as follows:
> > > >
> > > >   retry:
> > > >       if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) {
> > > >           sleep(n)
> > > >           goto retry;
> > > >       }
> > > >
> > > > Since the tasks reading the same files are not critical tasks, a delay in
> > > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but
> > > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it
> > > > will loop indefinitely until someone else reads it from disk, which is not
> > > > acceptable.
> > > >
> > > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This
> > > > flag will allow reading from the disk if the file is not in the page cache
> > > > but will not allow waiting for the lock if it is held by others. With this
> > > > new flag, we can resolve our issues effectively.
> > > >
> > > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0]
> > > > Link: https://github.com/elastic/beats/tree/master/filebeat [1]
> > > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2]
> > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> > > > Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> > >
> > > Thanks for the detailed explanation! I understand your problem but I have to
> > > say I find this flag like a hack to workaround particular XFS behavior and
> > > the guarantees the new RWF_IOWAIT flag should provide are not very clear to
> > > me.
> > 
> > Its guarantee is clear:
> > 
> >   : I/O is intended to be atomic to ordinary files and pipes and FIFOs.
> >   : Atomic means that all the bytes from a single operation that started
> >   : out together end up together, without interleaving from other I/O
> >   : operations.
> 
> Oh, I understand why XFS does locking this way and I'm well aware this is
> a requirement in POSIX. However, as you have experienced, it has a
> significant performance cost for certain workloads (at least with simple
> locking protocol we have now) and history shows users rather want the extra
> performance at the cost of being a bit more careful in userspace. So I
> don't see any filesystem switching to XFS behavior until we have a
> performant range locking primitive.
> 
> > What this flag does is avoid waiting for this type of lock if it
> > exists. Maybe we should consider a more descriptive name like
> > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always
> > challenging.
> 
> Aha, OK. So you want the flag to mean "I don't care about POSIX read-write
> exclusion". I'm still not convinced the flag is a great idea but
> RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag.

I have to say that I find this extremely hard to swallow because it so
clearly specific to an individual filesystem. If we're doing this hack I
would like an Ack from at least both Jan and Dave.




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux