On Wed, 28 Feb 2024 at 11:29, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > The more concerning sitution to me would be if breaking write atomicity > means that we end up with data in the file that doesn't correspond to an > total ordering of writes; e.g. part of write a, then write b, then the > rest of write a overlaying part of write b. > > Maybe that can't happen as long as writes are always happening in > ascending folio order? So that was what my suggestion about just overlapping one lock at a time was about - we always do writes in ascending order, and contiguously (even if the data source obviously isn't necessarily some contiguous thing). And I think that's actually fundamental and not something we're likely to get out of. If you do non-contiguous writes, you'll always treat them as separate things. Then just the "lock the next folio before unlocking the previous one" would already give some relevant guarantees, in that at least you wouldn't get overlapping writes where the write data would be mixed up. So you'd get *some* ordering, and while I wouldn't call it "total ordering" (and I think with readers not taking locks you can't get that anyway because readers will *always* see partial writes), I think it's much better than some re-write model. However, the "lock consecutive pages as you go along" does still have the issue of "you have to be able to take a page fault in the middle". And right now that actually depends on us always dropping the page lock in between iterations. This issue is solvable - if you get a partial read while you hold a page lock, you can always just see "are the pages I hold locks on not up-to-date?" And decide to do the read yourself (and mark them up-to-date). We don't do that right now because it's simpler not to, but it's not conceptually a huge problem. It *is* a practical problem, though. For example, in generic_perform_write(), we've left page locking on writes to the filesystems (ie it's done by "->write_begin/->write_end"), so I think in reality that "don't release the lock on folio N until after you've taken the lock on folio N+1" isn't actually wonderful. It has some really nasty practical issues. And yes, those practical issues are because of historical mistakes (some of them very much by yours truly). Having one single "page lock" was probably one of those historical mistakes. If we use a different bit for serializing page writers, the above problem would go away as an issue. ANYWAY. At least with the current setup we literally depend on that "one page at a time" behavior right now, and even XFS - which takes the inode lock shared for reading - does *not* do it for reading a page during a page fault for all these reasons. XFS uses iomap_write_iter() instead of generic_perform_write(), but the solution there is exactly the same, and the issue is fairly fundamental (unless you want to always read in pages that you are going to overwrite first). This is also one of the (many) reasons I think the POSIX atomicity model is complete garbage. I think the whole "reads are atomic with respect to writes" is a feel-good bedtime story. It's simplistic, and it's just not *real*, because it's basically not compatible with mmap. So the whole POSIX atomicity story comes from a historical implementation background and ignores mmap. Fine, people can live in that kind of "read and write without DIO is special" fairy tale and think that paper standards are more important than sane semantics. But I just am not a fan of that. So I am personally perfectly happy to say "POSIX atomicity is a stupid paper standard that has no relevance for reality". The read side *cannot* be atomic wrt the write side. But at the same time, I obviously then care a _lot_ about actual existing loads. I may not worry about some POSIX atomicity guarantees, but I *do* worry about real loads. And I don't think real loads actually care about concurrent overlapping writes at all, but the "I don't think" may be another wishful feel-good bedtime story that isn't based on reality ;) Linus