On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@xxxxxxx> wrote: > > On Sun 04-08-24 16:02:51, Yafang Shao wrote: > > Background > > ========== > > > > Our big data workloads are deployed on XFS-based disks, and we frequently > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because > > different applications may access the same files concurrently. For example, > > while a datanode task is writing to a file, a filebeat[0] task might be > > reading the same file concurrently. If the task writing to the file takes a > > long time, the task reading the file will hang due to contention on the XFS > > inode lock. > > > > This inode lock contention between writing and reading files only occurs on > > XFS, but not on other file systems such as EXT4. Dave provided a clear > > explanation for why this occurs only on XFS[1]: > > > > : I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > : Atomic means that all the bytes from a single operation that started > > : out together end up together, without interleaving from other I/O > > : operations. [2] > > : XFS is the only linux filesystem that provides this behaviour. > > > > As we have been running big data on XFS for years, we don't want to switch > > to other file systems like EXT4. Therefore, we plan to resolve these issues > > within XFS. > > > > Proposal > > ======== > > > > One solution we're currently exploring is leveraging the preadv2(2) > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode > > lock hung task. This can be illustrated as follows: > > > > retry: > > if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) { > > sleep(n) > > goto retry; > > } > > > > Since the tasks reading the same files are not critical tasks, a delay in > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it > > will loop indefinitely until someone else reads it from disk, which is not > > acceptable. > > > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This > > flag will allow reading from the disk if the file is not in the page cache > > but will not allow waiting for the lock if it is held by others. With this > > new flag, we can resolve our issues effectively. > > > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0] > > Link: https://github.com/elastic/beats/tree/master/filebeat [1] > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2] > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > Cc: Dave Chinner <david@xxxxxxxxxxxxx> > > Thanks for the detailed explanation! I understand your problem but I have to > say I find this flag like a hack to workaround particular XFS behavior and > the guarantees the new RWF_IOWAIT flag should provide are not very clear to > me. Its guarantee is clear: : I/O is intended to be atomic to ordinary files and pipes and FIFOs. : Atomic means that all the bytes from a single operation that started : out together end up together, without interleaving from other I/O : operations. What this flag does is avoid waiting for this type of lock if it exists. Maybe we should consider a more descriptive name like RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always challenging. Since this behavior is required by POSIX, it shouldn't be viewed as an XFS-specific behavior. Other filesystems might adopt this rule in the future as well. > I've CCed Amir who's been dealing with similar issues with XFS at his > employer and had some patches as far as I remember. > > What you could possibly do to read the file contents without blocking on > xfs_iolock is to mmap the file and grab the data from the mapping. It is > still hacky but at least we don't have to pollute the kernel with an IO > flag with unclear semantics. The file size to be read is not fixed, which is why we prefer to use the traditional read API rather than mmap. We have implemented a hotfix version of this commit on many of our production servers, and it works well as expected. While I agree that mmap() is another viable option, we may consider switching to it in the future if this new flag introduces any issues. -- Regards Yafang