On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote: > 2. Do we need to revalidate mappings for directio writes? I think the > answer is no (for xfs) because the ->iomap_begin call will allocate > whatever blocks are needed and truncate/punch/reflink block on the > iolock while the directio writes are pending, so you'll never end up > with a stale mapping. But I don't know if that statement applies > generally... The issue is not truncate/punch/reflink for either DIO or buffered IO - the issue that leads to stale iomaps is async extent state. i.e. IO completion doing unwritten extent conversion. For DIO, AIO doesn't hold the IOLOCK at all when completion is run (like buffered writeback), but non-AIO DIO writes hold the IOLOCK shared while waiting for completion. This means that we can have DIO submission and completion still running concurrently, and so stale iomaps are a definite possibility. >From my notes when I looked at this: 1. the race condition for a DIO write mapping go stale is an overlapping DIO completion and converting the block from unwritten to written, and then the dio write incorrectly issuing sub-block zeroing because the mapping is now stale. 2. DIO read into a hole or unwritten extent zeroes the entire range in the user buffer in one operation. If this is a large range, this could race with small DIO writes within that range that have completed 3. There is a window between dio write completion doing unwritten extent conversion (by ->end_io) and the page cache being invalidated, providing a window where buffered read maps can be stale and incorrect read behaviour exposed to userpace before the page cache is invalidated. These all stem from IO having overlapping ranges, which is largely unsupported but can't be entirely prevented (e.g. backup applications running in the background). Largely the problems are confined to sub-block IOs. i.e. when sub-block DIO writes to the same block are being performed, we have the possiblity that one write completes whilst the other is deciding what to zero, unaware that the range is now MAPPED rather than UNWRITTEN. We currently avoid issues with sub-block dio writes by using IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the unaligned IO fits entirely within a MAPPED extent so no sub-block zeroing is required. If allocation or sub-block zeroing is required, then we force the filesystem to fall back to exclusive IO locking and wait for all concurrent DIO in flight to complete so that it can't race with any other DIO write that might cause the map to become stale while we are doing the zeroing. This does not avoid potential issues with DIO write vs buffered read, nor DIO write vs mmap IO. It's not totally clear to me whether we need ->iomap_valid checks in the buffered read paths to avoid the completion races with DIO writes, but there are windows there where cached iomaps could be considered stale.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx