Re: [PATCH 0/7] Move prefaulting into write slow paths

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Thu, 30 Jan 2025 19:56:26 -0500

On Thu, Jan 30, 2025 at 08:04:49AM -0800, Dave Hansen wrote:
> On 1/29/25 23:44, Kent Overstreet wrote:
> > On Wed, Jan 29, 2025 at 10:17:49AM -0800, Dave Hansen wrote:
> >> tl;dr: The VFS and several filesystems have some suspect prefaulting
> >> code. It is unnecessarily slow for the common case where a write's
> >> source buffer is resident and does not need to be faulted in.
> >>
> >> Move these "prefaulting" operations to slow paths where they ensure
> >> forward progress but they do not slow down the fast paths. This
> >> optimizes the fast path to touch userspace once instead of twice.
> >>
> >> Also update somewhat dubious comments about the need for prefaulting.
> >>
> >> This has been very lightly tested. I have not tested any of the fs/
> >> code explicitly.
> > 
> > Q: what is preventing us from posting code to the list that's been
> > properly tested?
> > 
> > I just got another bcachefs patch series that blew up immediately when I
> > threw it at my CI.
> > 
> > This is getting _utterly ridiculous_.
> 
> In this case, I started with a single patch for generic code that I knew
> I could test. In fact, I even had the 9-year-old binary sitting on my
> test box.
> 
> Dave Chinner suggested that I take the generic pattern go look a _bit_
> more widely in the tree for a similar pattern. That search paid off, I
> think. But I ended up touching corners of the tree I don't know well and
> don't have test cases for.

That's all well and good, but the testing thing is really coming to a
head. I have enough on my plate without back-and-forth 

> > I built multiuser test infrastructure with a nice dashboard that anyone
> > can use, and the only response I've gotten from the old guard is Ted
> > jumping in every time I talk about it to say "no, we just don't want to
> > rewrite our stuff on _your_ stuff!". Real helpful, that.
> 
> Sounds pretty cool! Is this something that I could have and should have
> used to test the bcachefs patch?  I see some trees in here:
> 
> 	https://evilpiepirate.org/~testdashboard/ci
> 
> But I'm not sure how to submit patches to it. Do you need to add users
> manually? I wonder, though, how we could make it easier to find. I
> didn't see anything Documentation/filesystems/bcachefs/ about this.

Yes, I give out user accounts and that gives you a config file you can
edit to specify branches to test and tests to run; it then automatically
watches those branch(es).

Here's the thing though, the servers cost real money. I give out
accounts to community members (and people working on bcachefs are using
it and it's working wel), but if you've got a big tech company email
address you (or your managers) will have to be pitching in. Not
subsidizing you guys :)

> 
> >>  1. Deadlock avoidance if the source and target are the same
> >>     folios.
> >>  2. To check the user address that copy_folio_from_iter_atomic()
> >>     will touch because atomic user copies do not check the address.
> >>  3. "Optimization"
> >>
> >> I'm not sure any of these are actually valid reasons.
> >>
> >> The "atomic" user copy functions disable page fault handling because
> >> page faults are not very atomic. This makes them naturally resistant
> >> to deadlocking in page fault handling. They take the page fault
> >> itself but short-circuit any handling.
> > 
> > #1 is emphatically valid: the deadlock avoidance is in _both_ using
> > _atomic when we have locks held, and doing the actual faulting with
> > locks dropped... either alone would be a buggy incomplete solution.
> 
> I was (badly) attempting to separate out the two different problems:
> 
> 	1. Doing lock_page() twice, which I was mostly calling the
> 	   "deadlock"
> 	2. Retrying the copy_folio_from_iter_atomic() forever which I
> 	   was calling the "livelock"
> 
> Disabling page faults fixes #1.
> Doing faulting outside the locks somewhere fixes #2.
> 
> So when I was talking about "Deadlock avoidance" in the cover letter, I
> was trying to focus on the double lock_page() problem.
> 
> > This needs to be reflected and fully described in the comments, since
> > it's subtle and a lot of people don't fully grok what's going on.
> 
> Any suggestions for fully describing the situation? I tried to sprinkle
> comments liberally but I'm also painfully aware that I'm not doing a
> perfect job of talking about the fs code.

The critical thing to cover is the fact that mmap means that page faults
can recurse into arbitrary filesystem code, thus blowing a hole in all
our carefully crafted lock ordering if we allow that while holding
locks - you didn't mention that at all.

> > I'm fairly certain we have ioctl code where this is mishandled and thus
> > buggy, because it takes some fairly particular testing for lockdep to
> > spot it.
> 
> Yeah, I wouldn't be surprised. It was having a little chuckle thinking
> about how many engineers have discovered and fixed this problem
> independently over the years in all the file system code in all the OSes.

Oh, this is the easy one - mmap and dio is where it really gets into
"stories we tell young engineers to keep them awake at night".

> 
> >> copy_folio_from_iter_atomic() also *does* have user address checking.
> >> I get a little lost in the iov_iter code, but it does know when it's
> >> dealing with userspace versus kernel addresses and does seem to know
> >> when to do things like copy_from_user_iter() (which does access_ok())
> >> versus memcpy_from_iter().[1]
> >>
> >> The "optimization" is for the case where 'source' is not faulted in.
> >> It can avoid the cost of a "failed" page fault (it will fail to be
> >> handled because of the atomic copy) and then needing to drop locks and
> >> repeat the fault.
> > 
> > I do agree on moving it to the slowpath - I think we can expect the case
> > where the process's immediate workingset is faulted out while it's
> > running to be vanishingly small.
> 
> Great! I'm glad we're on the same page there.
> 
> For bcachefs specifically, how should we move forward? If you're happy
> with the concept, would you prefer that I do some manual bcachefs
> testing? Or leave a branch sitting there for a week and pray the robots
> test it?

No to the sit and pray. If I see one more "testing? that's something
other people do" conversation I'll blow another gasket.

xfstests supports bcachefs, and if you need a really easy way to run it
locally on all the various filesystems, I have a solution for that:

https://evilpiepirate.org/git/ktest.git/

If you want access to my CI that runs all that in parallel across 120
VMs with the nice dashboard - shoot me an email and I'll outline server
costs and we can work something out.