Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 28 Feb 2024 18:21:37 -0500

On Wed, Feb 28, 2024 at 12:17:50PM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 11:29, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> >
> > The more concerning sitution to me would be if breaking write atomicity
> > means that we end up with data in the file that doesn't correspond to an
> > total ordering of writes; e.g. part of write a, then write b, then the
> > rest of write a overlaying part of write b.
> >
> > Maybe that can't happen as long as writes are always happening in
> > ascending folio order?
> 
> So that was what my suggestion about just overlapping one lock at a
> time was about - we always do writes in ascending order, and
> contiguously (even if the data source obviously isn't necessarily some
> contiguous thing).
> 
> And I think that's actually fundamental and not something we're likely
> to get out of. If you do non-contiguous writes, you'll always treat
> them as separate things.
> 
> Then just the "lock the next folio before unlocking the previous one"
> would already give some relevant guarantees, in that at least you
> wouldn't get overlapping writes where the write data would be mixed
> up.
> 
> So you'd get *some* ordering, and while I wouldn't call it "total
> ordering" (and I think with readers not taking locks you can't get
> that anyway because readers will *always* see partial writes), I think
> it's much better than some re-write model.

Well, the re-write model is for the case where we have to bail out, drop
locks and re-fault, and that's no different here. But that's a side
issue.

The "just take the next lock before dropping the previous lock" approach
- I agree that works just as well. I'm still trying to come up with a
satisfying proof as to why - but one writer can't pass another writer,
it's basically that.

> 
> However, the "lock consecutive pages as you go along" does still have
> the issue of "you have to be able to take a page fault in the middle".
> 
> And right now that actually depends on us always dropping the page
> lock in between iterations.
> 
> This issue is solvable - if you get a partial read while you hold a
> page lock, you can always just see "are the pages I hold locks on not
> up-to-date?" And decide to do the read yourself (and mark them
> up-to-date). We don't do that right now because it's simpler not to,
> but it's not conceptually a huge problem.

Hang on, I think you're talking about two different issues. The "oops,
copying from the _user_ buffer faulted" is the more annoying one here.
If the destination folio wasn't uptodate, we just read that inline, we
don't drop locks for that (at least my code doesn't; I'd have to have a
look at filemap.c and iomap for those versions).

There is another funny corner case where the destination folio wasn't
uptodate, and we decided we didn't need to read it in because we were
going to be fully overwriting it, but then we _didn't_ fully overwrite
it - then we have to revert that part of the write.

But if that happened it was because the useer copy didn't copy as much
as we expected, i.e. it would've needed to take a page fault, so we have
to bail out, drop locks and refault anyways.

As an aside, the "copy_to_(from|user) may recurse into arbitrary code
and take arbitrary locks" is some scary stuff, and I very much doubt
it's handled correctly everywhere. If I ever do the ioctl v2 proposal I
wrote up awhile back, I think part of that should be doubble buffering
approximately at the syscall layer so that we're not doing that from
within rando driver/fs code holding rando locks; the 1% of cases that
care about performance can do something special, but most shouldn't.

> It *is* a practical problem, though.
> 
> For example, in generic_perform_write(), we've left page locking on
> writes to the filesystems (ie it's done by
> "->write_begin/->write_end"), so I think in reality that "don't
> release the lock on folio N until after you've taken the lock on folio
> N+1" isn't actually wonderful. It has some really nasty practical
> issues.

yeah, write_begin and write_end are... odd, to say the least.

> And yes, those practical issues are because of historical mistakes
> (some of them very much by yours truly). Having one single "page lock"
> was probably one of those historical mistakes. If we use a different
> bit for serializing page writers, the above problem would go away as
> an issue.

well, it's a difficult situation, a _lot_ of different code ends up
wanting to communicate state machine style through page flags.

but yeah, we're not clear on whether folio lock protects
 - the data within the folio?
 - is it used for signalling read completion? (sometimes!)
 - is it used for guarding against against truncate/invalidate? There's
   multiple different truncate/invalidate paths, for doing different
   things based on folio dirty/writeback and perhaps locked
 - does it protect other filesystem state hanging off folio->private?
   you might thing so, but I came to the conclusion that that was a bad
   idea

> ANYWAY.
> 
> At least with the current setup we literally depend on that "one page
> at a time" behavior right now, and even XFS - which takes the inode
> lock shared for reading - does *not* do it for reading a page during a
> page fault for all these reasons.
> 
> XFS uses iomap_write_iter() instead of generic_perform_write(), but
> the solution there is exactly the same, and the issue is fairly
> fundamental (unless you want to always read in pages that you are
> going to overwrite first).
> 
> This is also one of the (many) reasons I think the POSIX atomicity
> model is complete garbage. I think the whole "reads are atomic with
> respect to writes" is a feel-good bedtime story. It's simplistic, and
> it's just not *real*, because it's basically not compatible with mmap.

oh, I wouldn't go that far.

I think the real problem is just undocumented behaviour that differs
across filesystems, and we've got too many filesystems doing their own
thing where their smart stuff should've been lifted up to the vfs (and
yes, I am a guilty party here too).

And another part of what makes this stuff hard is that it's always such
a goddamn hassle to extend userspace interfaces, even just to add a
couple flags. People have a real fear of that stuff, and when things are
proposed they end up getting bikeshed to death, and then we _overdesign_
stuff so that we don't have to repeat that process... no. Make it easy
to add new userspace interfaces, do it in small bite sized chunks, one
thing at a time, learn from our mistakes, depracate and add new
revisions for the stuff that turned out to be crap.

...What we really want here is just some flags so that userspace can
specify what locking model they want. We're clearly capable of
implementing multiple locking models, and the real nightmare scenario is
some userspace program depending on stupidly subtle atomicity guarantees
that only one filesystem implements AND THERE'S NOTHING TO GREP FOR IN
THE CODE TO FIND IT.