Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 28 Feb 2024 12:17:50 -0800

On Wed, 28 Feb 2024 at 11:29, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>
> The more concerning sitution to me would be if breaking write atomicity
> means that we end up with data in the file that doesn't correspond to an
> total ordering of writes; e.g. part of write a, then write b, then the
> rest of write a overlaying part of write b.
>
> Maybe that can't happen as long as writes are always happening in
> ascending folio order?

So that was what my suggestion about just overlapping one lock at a
time was about - we always do writes in ascending order, and
contiguously (even if the data source obviously isn't necessarily some
contiguous thing).

And I think that's actually fundamental and not something we're likely
to get out of. If you do non-contiguous writes, you'll always treat
them as separate things.

Then just the "lock the next folio before unlocking the previous one"
would already give some relevant guarantees, in that at least you
wouldn't get overlapping writes where the write data would be mixed
up.

So you'd get *some* ordering, and while I wouldn't call it "total
ordering" (and I think with readers not taking locks you can't get
that anyway because readers will *always* see partial writes), I think
it's much better than some re-write model.

However, the "lock consecutive pages as you go along" does still have
the issue of "you have to be able to take a page fault in the middle".

And right now that actually depends on us always dropping the page
lock in between iterations.

This issue is solvable - if you get a partial read while you hold a
page lock, you can always just see "are the pages I hold locks on not
up-to-date?" And decide to do the read yourself (and mark them
up-to-date). We don't do that right now because it's simpler not to,
but it's not conceptually a huge problem.

It *is* a practical problem, though.

For example, in generic_perform_write(), we've left page locking on
writes to the filesystems (ie it's done by
"->write_begin/->write_end"), so I think in reality that "don't
release the lock on folio N until after you've taken the lock on folio
N+1" isn't actually wonderful. It has some really nasty practical
issues.

And yes, those practical issues are because of historical mistakes
(some of them very much by yours truly). Having one single "page lock"
was probably one of those historical mistakes. If we use a different
bit for serializing page writers, the above problem would go away as
an issue.

ANYWAY.

At least with the current setup we literally depend on that "one page
at a time" behavior right now, and even XFS - which takes the inode
lock shared for reading - does *not* do it for reading a page during a
page fault for all these reasons.

XFS uses iomap_write_iter() instead of generic_perform_write(), but
the solution there is exactly the same, and the issue is fairly
fundamental (unless you want to always read in pages that you are
going to overwrite first).

This is also one of the (many) reasons I think the POSIX atomicity
model is complete garbage. I think the whole "reads are atomic with
respect to writes" is a feel-good bedtime story. It's simplistic, and
it's just not *real*, because it's basically not compatible with mmap.

So the whole POSIX atomicity story comes from a historical
implementation background and ignores mmap.

Fine, people can live in that kind of "read and write without DIO is
special" fairy tale and think that paper standards are more important
than sane semantics. But I just am not a fan of that.

So I am personally perfectly happy to say "POSIX atomicity is a stupid
paper standard that has no relevance for reality". The read side
*cannot* be atomic wrt the write side.

But at the same time, I obviously then care a _lot_ about actual
existing loads. I may not worry about some POSIX atomicity guarantees,
but I *do* worry about real loads.

And I don't think real loads actually care about concurrent
overlapping writes at all, but the "I don't think" may be another
wishful feel-good bedtime story that isn't based on reality ;)

              Linus