Hi folks, THese patches address the data corruption first described here: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/ This data corruption has been seen in high profile production systems so there is some urgency to fix it. The underlying flaw is essentially a zero-day iomap bug, so whatever fix we come up with needs to be back portable to all supported stable kernels (i.e. ~4.18 onwards). A combination of concurrent write()s, writeback IO completion, and memory reclaim combine to expose the fact that the cached iomap that is held across an iomap_begin/iomap_end iteration can become stale without the iomap iterator actor being aware that the underlying filesystem extent map has changed. Hence actions based on the iomap state (e.g. is unwritten or newly allocated) may actually be incorrect as writeback actions may have changed the state (unwritten to written, delalloc to unwritten or written, etc). This affects partial block/page operations, where we may need to read from disk or zero cached pages depending on the actual extent state. Memory reclaim plays it's part here in that it removes pages containing partial state from the page cache, exposing future partial page/block operations to incorrect behaviour. Really, we should have known that this would be a problem - we have exactly the same issue with cached iomaps for writeback, and the ->map_blocks callback that occurs for every filesystem block we need to write back is responsible for validating the cached iomap is still valid. The data corruption on the write() side is a result of not validating that the iomap is still valid before we initialise new pages and prepare them for data to be copied in to them.... I'm not really happy with the solution I have for triggering remapping of an iomap when the current one is considered stale. Doing the right thing requires both iomap_iter() to handle stale iomaps correctly (esp. the "map is invalid before the first actor operation" case), and it requires the filesystem iomap_begin/iomap_end operations to co-operate and be aware of stale iomaps. There are a bunch of *nasty* issues around handling failed writes in XFS taht this has exposed - a failed write() that races with a mmap() based write to the same delalloc page will result in the mmap writes being silently lost if we punch out the delalloc range we allocated but didn't write to. g/344 and g/346 expose this bug directly if we punch out delalloc regions allocated by now stale mappings. Then, because we can't punch out the delalloc we allocated region safely when we have a stale iomap, we have to ensure when we remap it the IOMAP_F_NEW flag is preserved so that the iomap code knows that it is uninitialised space that is being written into so it will zero sub page/sub block ranges correctly. As a result, ->iomap_begin() needs to know if the previous iomap was IOMAP_F_STALE, and if so, it needs to know if that previous iomap was IOMAP_F_NEW so it can propagate it to the remap. So the fix is awful, messy, and I really, really don't like it. But I don't have any better ideas right now, and the changes as presented fix the reproducer for the original data corruption and pass fstests without and XFS regressions for block size <= page size configurations. Thoughts? -Dave.