On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > just like everyone else. Not for dio, it doesn't. > > The real question is how much of userspace will that break, because > > of implicit assumptions that the kernel has always serialised > > buffered writes? > > What would break? Well, at least in theory you could have concurrent overlapping writes of folio crossing records, and currently you do get the guarantee that one or the other record is written, but relying just on page locking would mean that you might get a mix of them at page boundaries. I'm not sure that such a model would make any sense, but if you *intend* to break if somebody doesn't do write-to-write exclusion, that's certainly possible. The fact that we haven't given the atomicity guarantees wrt reads does imply that nobody can do this kinds of crazy thing, but it's an implication, not a guarantee. I really don't think such an odd load is sensible (except for the special case of O_APPEND records, which definitely is sensible), and it is certainly solvable. For example, a purely "local lock" model would be to just lock all pages in order as you write them, and not unlock the previous page until you've locked the next one. That is a really simple model that doesn't require any range locking or anything like that because it simply relies on all writes always being done strictly in file position order. But you'd have to be very careful with deadlocks anyway in case there are other cases of multi-page locks. And even without deadlocks, you might end up having just a lot more lock contention (nested locks can have *much* worse contention than sequential ones) There are other models with multi-level locking, but I think we'd like to try to keep things simple if we change something core like this. Linus