On Tue, Aug 06, 2024 at 03:24:32PM GMT, Jan Kara wrote: > On Tue 06-08-24 19:54:58, Yafang Shao wrote: > > On Mon, Aug 5, 2024 at 9:40 PM Jan Kara <jack@xxxxxxx> wrote: > > > On Sun 04-08-24 16:02:51, Yafang Shao wrote: > > > > Background > > > > ========== > > > > > > > > Our big data workloads are deployed on XFS-based disks, and we frequently > > > > encounter hung tasks caused by xfs_ilock. These hung tasks arise because > > > > different applications may access the same files concurrently. For example, > > > > while a datanode task is writing to a file, a filebeat[0] task might be > > > > reading the same file concurrently. If the task writing to the file takes a > > > > long time, the task reading the file will hang due to contention on the XFS > > > > inode lock. > > > > > > > > This inode lock contention between writing and reading files only occurs on > > > > XFS, but not on other file systems such as EXT4. Dave provided a clear > > > > explanation for why this occurs only on XFS[1]: > > > > > > > > : I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > > > : Atomic means that all the bytes from a single operation that started > > > > : out together end up together, without interleaving from other I/O > > > > : operations. [2] > > > > : XFS is the only linux filesystem that provides this behaviour. > > > > > > > > As we have been running big data on XFS for years, we don't want to switch > > > > to other file systems like EXT4. Therefore, we plan to resolve these issues > > > > within XFS. > > > > > > > > Proposal > > > > ======== > > > > > > > > One solution we're currently exploring is leveraging the preadv2(2) > > > > syscall. By using the RWF_NOWAIT flag, preadv2(2) can avoid the XFS inode > > > > lock hung task. This can be illustrated as follows: > > > > > > > > retry: > > > > if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) { > > > > sleep(n) > > > > goto retry; > > > > } > > > > > > > > Since the tasks reading the same files are not critical tasks, a delay in > > > > reading is acceptable. However, RWF_NOWAIT not only enables IOCB_NOWAIT but > > > > also enables IOCB_NOIO. Therefore, if the file is not in the page cache, it > > > > will loop indefinitely until someone else reads it from disk, which is not > > > > acceptable. > > > > > > > > So we're planning to introduce a new flag, IOCB_IOWAIT, to preadv2(2). This > > > > flag will allow reading from the disk if the file is not in the page cache > > > > but will not allow waiting for the lock if it is held by others. With this > > > > new flag, we can resolve our issues effectively. > > > > > > > > Link: https://lore.kernel.org/linux-xfs/20190325001044.GA23020@dastard/ [0] > > > > Link: https://github.com/elastic/beats/tree/master/filebeat [1] > > > > Link: https://pubs.opengroup.org/onlinepubs/009695399/functions/read.html [2] > > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > > > Cc: Dave Chinner <david@xxxxxxxxxxxxx> > > > > > > Thanks for the detailed explanation! I understand your problem but I have to > > > say I find this flag like a hack to workaround particular XFS behavior and > > > the guarantees the new RWF_IOWAIT flag should provide are not very clear to > > > me. > > > > Its guarantee is clear: > > > > : I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > : Atomic means that all the bytes from a single operation that started > > : out together end up together, without interleaving from other I/O > > : operations. > > Oh, I understand why XFS does locking this way and I'm well aware this is > a requirement in POSIX. However, as you have experienced, it has a > significant performance cost for certain workloads (at least with simple > locking protocol we have now) and history shows users rather want the extra > performance at the cost of being a bit more careful in userspace. So I > don't see any filesystem switching to XFS behavior until we have a > performant range locking primitive. > > > What this flag does is avoid waiting for this type of lock if it > > exists. Maybe we should consider a more descriptive name like > > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always > > challenging. > > Aha, OK. So you want the flag to mean "I don't care about POSIX read-write > exclusion". I'm still not convinced the flag is a great idea but > RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag. I have to say that I find this extremely hard to swallow because it so clearly specific to an individual filesystem. If we're doing this hack I would like an Ack from at least both Jan and Dave.