On Tue 06-08-24 19:54:58, Yafang Shao wrote: > On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@xxxxxxx> wrote: > > On Sun 04-08-24 16:02:51, Yafang Shao wrote: > > > Background > > > ========== > > > > > > Our big data workloads are deployed on XFS-based disks, and we frequently > > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because > > > different applications may access the same files concurrently. For example, > > > while a datanode task is writing to a file, a filebeat[0] task might be > > > reading the same file concurrently. If the task writing to the file takes a > > > long time, the task reading the file will hang due to contention on the XFS > > > inode lock. > > > > > > This inode lock contention between writing and reading files only occurs on > > > XFS, but not on other file systems such as EXT4. Dave provided a clear > > > explanation for why this occurs only on XFS[1]: > > > > > > : I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > > : Atomic means that all the bytes from a single operation that started > > > : out together end up together, without interleaving from other I/O > > > : operations. [2] > > > : XFS is the only linux filesystem that provides this behaviour. > > > > > > As we have been running big data on XFS for years, we don't want to switch > > > to other file systems like EXT4. Therefore, we plan to resolve these issues > > > within XFS. > > > > > > Proposal > > > ======== > > > > > > One solution we're currently exploring is leveraging the preadv2(2) > > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode > > > lock hung task. This can be illustrated as follows: > > > > > > retry: > > > if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) { > > > sleep(n) > > > goto retry; > > > } > > > > > > Since the tasks reading the same files are not critical tasks, a delay in > > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but > > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it > > > will loop indefinitely until someone else reads it from disk, which is not > > > acceptable. > > > > > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This > > > flag will allow reading from the disk if the file is not in the page cache > > > but will not allow waiting for the lock if it is held by others. With this > > > new flag, we can resolve our issues effectively. > > > > > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0] > > > Link: https://github.com/elastic/beats/tree/master/filebeat [1] > > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2] > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > > Cc: Dave Chinner <david@xxxxxxxxxxxxx> > > > > Thanks for the detailed explanation! I understand your problem but I have to > > say I find this flag like a hack to workaround particular XFS behavior and > > the guarantees the new RWF_IOWAIT flag should provide are not very clear to > > me. > > Its guarantee is clear: > > : I/O is intended to be atomic to ordinary files and pipes and FIFOs. > : Atomic means that all the bytes from a single operation that started > : out together end up together, without interleaving from other I/O > : operations. Oh, I understand why XFS does locking this way and I'm well aware this is a requirement in POSIX. However, as you have experienced, it has a significant performance cost for certain workloads (at least with simple locking protocol we have now) and history shows users rather want the extra performance at the cost of being a bit more careful in userspace. So I don't see any filesystem switching to XFS behavior until we have a performant range locking primitive. > What this flag does is avoid waiting for this type of lock if it > exists. Maybe we should consider a more descriptive name like > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always > challenging. Aha, OK. So you want the flag to mean "I don't care about POSIX read-write exclusion". I'm still not convinced the flag is a great idea but RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR