On Wed, Dec 9, 2020 at 10:40 AM Will Deacon <will@xxxxxxxxxx> wrote: > > > And yes, that probably means that you need to change "alloc_set_pte()" > > to actually pass in the real address, and leave "vmf->address" alone - > > so that it can know which ones are prefaulted and which one is real, > > but that sounds like a good idea anyway. > > Right, I deliberately avoided that based on the feedback from Jan on an > older version [1], but I can certainly look at it again. Side note: I absolutely detest alloc_set_pte() in general, and it would be very good to try to just change the rules of that function entirely. The alloc_set_pte() function is written to be some generic "iterate over ptes setting them'" function, but it has exactly two users, and none of those users really want the semantics it gives them, I feel. And the semantics that alloc_set_pte() does have actually *hurts* them. In particular, it made it a nightmare to read what do_fault_around() does: it does that odd if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); and then it calls ->map_pages() (which is always filemap_map_pages(), except for xfs, where it is also always filemap_map_pages but it takes a lock first). And it did that prealloc_pte, because that's what alloc_set_pte() needs to be atomic - and it needs to be atomic because it's all done under the RCU lock. So essentially, the _major_ user of alloc_set_pte() requires atomicity - but that's not at all obvious when you actually read alloc_set_pte() itself, and in fact when you read it, you see that if (!vmf->pte) { ret = pte_alloc_one_map(vmf); which can block. Except it won't block if we have vmf->prealloc_pte, because then it will just take that instead. In other words, ALL THAT IS COMPLETE GARBAGE. And it's written to be as obtuse as humanly possible, and there is no way in hell anybody understands it by just looking at the code - you have to follow all these odd quirks back to see what's going on. So it would be much better to get rid of alloc_set_pte() entirely, and move the allocation and the pte locking into the caller, and just clean it up (because it turns out the callers have their own models for allocation anyway). And then each of the callers would be a lot more understandable, instead of having this insanely subtle "I need to be atomic, so I will pre-allocate in one place, and then take the RCU lock in another place, in order to call a _third_ place that is only atomic because of that first step". The other user of alloc_set_pte() is "finish_fault()", and honestly, that's what alloc_set_pte() was written for, and why alloc_set_pte() does all those things that filemap_map_pages() doesn't even want. But while that other user does want what alloc_set_pte() does, there's no downside to just moving those things into the caller. It would once again just clarify things to make the allocation and the setting be separate operations - even in that place where it doesn't have the same kind of very subtle behavior with pre-allocation and atomicity. I think the problem with alloc_set_pte() is hat it has had many different people working on it over the years, and Kirill massaged that old use to fit the new pre-alloc use. Before the pre-faulting, I think both cases were much closer to each other. So I'm not really blaming anybody here, but the ugly and very hard-to-follow rules seem to come from historical behavior that was massaged over time to "just work" rather than have that function be made more logical. Kirill - I think you know this code best. Would you be willing to look at splitting out that "allocate and map" from alloc_set_pte() and into the callers? At that point, I think the current very special and odd do_fault_around() pre-allocation could be made into just a _regular_ "allocate the pmd if it doesn't exist". And then the pte locking could be moved into filemap_map_pages(), and suddenly the semantics and rules around all that would be a whole lot more obvious. Pretty please? Linus