Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 9 Dec 2020 11:04:13 -0800

On Wed, Dec 9, 2020 at 10:40 AM Will Deacon <will@xxxxxxxxxx> wrote:
>
> > And yes, that probably means that you need to change "alloc_set_pte()"
> > to actually pass in the real address, and leave "vmf->address" alone -
> > so that it can know which ones are prefaulted and which one is real,
> > but that sounds like a good idea anyway.
>
> Right, I deliberately avoided that based on the feedback from Jan on an
> older version [1], but I can certainly look at it again.

Side note: I absolutely detest alloc_set_pte() in general, and it
would be very good to try to just change the rules of that function
entirely.

The alloc_set_pte() function is written to be some generic "iterate
over ptes setting them'" function, but it has exactly two users, and
none of those users really want the semantics it gives them, I feel.
And the semantics that alloc_set_pte() does have actually *hurts*
them.

In particular, it made it a nightmare to read what do_fault_around()
does: it does that odd

        if (pmd_none(*vmf->pmd)) {
                vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);

and then it calls ->map_pages() (which is always filemap_map_pages(),
except for xfs, where it is also always filemap_map_pages but it takes
a lock first).

And it did that prealloc_pte, because that's what alloc_set_pte()
needs to be atomic - and it needs to be atomic because it's all done
under the RCU lock.

So essentially, the _major_ user of alloc_set_pte() requires atomicity
- but that's not at all obvious when you actually read alloc_set_pte()
itself, and in fact when you read it, you see that

        if (!vmf->pte) {
                ret = pte_alloc_one_map(vmf);

which can block. Except it won't block if we have vmf->prealloc_pte,
because then it will just take that instead.

In other words, ALL THAT IS COMPLETE GARBAGE. And it's written to be
as obtuse as humanly possible, and there is no way in hell anybody
understands it by just looking at the code - you have to follow all
these odd quirks back to see what's going on.

So it would be much better to get rid of alloc_set_pte() entirely, and
move the allocation and the pte locking into the caller, and just
clean it up (because it turns out the callers have their own models
for allocation anyway). And then each of the callers would be a lot
more understandable, instead of having this insanely subtle "I need to
be atomic, so I will pre-allocate in one place, and then take the RCU
lock in another place, in order to call a _third_ place that is only
atomic because of that first step".

The other user of alloc_set_pte() is "finish_fault()", and honestly,
that's what alloc_set_pte() was written for, and why alloc_set_pte()
does all those things that filemap_map_pages() doesn't even want.

But while that other user does want what alloc_set_pte() does, there's
no downside to just moving those things into the caller. It would once
again just clarify things to make the allocation and the setting be
separate operations - even in that place where it doesn't have the
same kind of very subtle behavior with pre-allocation and atomicity.

I think the problem with alloc_set_pte() is hat it has had many
different people working on it over the years, and Kirill massaged
that old use to fit the new pre-alloc use. Before the pre-faulting, I
think both cases were much closer to each other.

So I'm not really blaming anybody here, but the ugly and very
hard-to-follow rules seem to come from historical behavior that was
massaged over time to "just work" rather than have that function be
made more logical.

Kirill - I think you know this code best. Would you be willing to look
at splitting out that "allocate and map" from alloc_set_pte() and into
the callers?

At that point, I think the current very special and odd
do_fault_around() pre-allocation could be made into just a _regular_
"allocate the pmd if it doesn't exist". And then the pte locking could
be moved into filemap_map_pages(), and suddenly the semantics and
rules around all that would be a whole lot more obvious.

Pretty please?

              Linus