On Mon, Apr 22, 2024 at 02:39:21PM +0000, John Garry wrote: > Add special handling of PG_atomic flag to iomap buffered write path. > > To flag an iomap iter for an atomic write, set IOMAP_ATOMIC. > > For a folio associated with a write which has IOMAP_ATOMIC set, set > PG_atomic. > > Otherwise, when IOMAP_ATOMIC is unset, clear PG_atomic. > > This means that for an "atomic" folio which has not been written back, it > loses it "atomicity". So if userspace issues a write with RWF_ATOMIC set > and another write with RWF_ATOMIC unset and which fully or partially > overwrites that same region as the first write, that folio is not written > back atomically. For such a scenario to occur, it would be considered a > userspace usage error. > > To ensure that a buffered atomic write is written back atomically when > the write syscall returns, RWF_SYNC or similar needs to be used (in > conjunction with RWF_ATOMIC). > > As a safety check, when getting a folio for an atomic write in > iomap_get_folio(), ensure that the length matches the inode mapping folio > order-limit. > > Only a single BIO should ever be submitted for an atomic write. So modify > iomap_add_to_ioend() to ensure that we don't try to write back an atomic > folio as part of a larger mixed-atomicity BIO. > > In iomap_alloc_ioend(), handle an atomic write by setting REQ_ATOMIC for > the allocated BIO. > > When a folio is written back, again clear PG_atomic, as it is no longer > required. I assume it will not be needlessly written back a second time... I'm not taking a position on the mechanism yet; need to think about it some more. But there's a hole here I also don't have a solution to, so we can all start thinking about it. In iomap_write_iter(), we call copy_folio_from_iter_atomic(). Through no fault of the application, if the range crosses a page boundary, we might partially copy the bytes from the first page, then take a page fault on the second page, hence doing a short write into the folio. And there's nothing preventing writeback from writing back a partially copied folio. Now, if it's not dirty, then it can't be written back. So if we're doing an atomic write, we could clear the dirty bit after calling iomap_write_begin() (given the usage scenarios we've discussed, it should always be clear ...) We need to prevent the "fall back to a short copy" logic in iomap_write_iter() as well. But then we also need to make sure we don't get stuck in a loop, so maybe go three times around, and if it's still not readable as a chunk, -EFAULT?