On Tue, Feb 27, 2024 at 02:46:11PM -0800, Linus Torvalds wrote: > On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > > just like everyone else. > > Not for dio, it doesn't. > > > > The real question is how much of userspace will that break, because > > > of implicit assumptions that the kernel has always serialised > > > buffered writes? > > > > What would break? > > Well, at least in theory you could have concurrent overlapping writes > of folio crossing records, and currently you do get the guarantee that > one or the other record is written, but relying just on page locking > would mean that you might get a mix of them at page boundaries. > > I'm not sure that such a model would make any sense, but if you > *intend* to break if somebody doesn't do write-to-write exclusion, > that's certainly possible. > > The fact that we haven't given the atomicity guarantees wrt reads does > imply that nobody can do this kinds of crazy thing, but it's an > implication, not a guarantee. > > I really don't think such an odd load is sensible (except for the > special case of O_APPEND records, which definitely is sensible), and > it is certainly solvable. > > For example, a purely "local lock" model would be to just lock all > pages in order as you write them, and not unlock the previous page > until you've locked the next one. The code I'm testing locks _all_ the folios we're writing to simultaneously, and if they can't all be pinned and locked just falls back to the inode lock. Which does raise the question of if we've ever attempted to define a lock ordering on folios. I suspect not, since folio lock doesn't even seem to have lockdep support.