On Tue, 22 Aug 2023, Jann Horn wrote: > On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > On Mon, 21 Aug 2023, Jann Horn wrote: > > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > Just for this case, take the pmd_lock() two steps earlier: not because > > > > it gives any protection against this case itself, but because ptlock > > > > nests inside it, and it's the dropping of ptlock which let the bug in. > > > > In other cases, continue to minimize the pmd_lock() hold time. > > > > > > Special-casing userfaultfd like this makes me a bit uncomfortable; but > > > I also can't find anything other than userfaultfd that would insert > > > pages into regions that are khugepaged-compatible, so I guess this > > > works? > > > > I'm as sure as I can be that it's solely because userfaultfd breaks > > the usual rules here (and in fairness, IIRC Andrea did ask my permission > > before making it behave that way on shmem, COWing without a source page). > > > > Perhaps something else will want that same behaviour in future (it's > > tempting, but difficult to guarantee correctness); for now, it is just > > userfaultfd (but by saying "_armed" rather than "_missing", I'm half- > > expecting uffd to add more such exceptional modes in future). > > Hm, yeah, sounds okay. (I guess we'd also run into this if we ever > wanted to make it possible to reliably install PTE markers with > madvise() or something like that, which might be nice for allowing > userspace to create guard pages without unnecessary extra VMAs...) I see the mailthread has taken inspiration from your comment there, and veered off in that direction: but I'll ignore those futures. > > > > I guess an alternative would be to use a spin_trylock() instead of the > > > current pmd_lock(), and if that fails, temporarily drop the page table > > > lock and then restart from step 2 with both locks held - and at that > > > point the page table scan should be fast since we expect it to usually > > > be empty. > > > > That's certainly a good idea, if collapse on userfaultfd_armed private > > is anything of a common case (I doubt, but I don't know). It may be a > > better idea anyway (saving a drop and retake of ptlock). > > I was thinking it also has the advantage that it would still perform > okay if we got rid of the userfaultfd_armed() condition at some point > - though I realize that designing too much for hypothetical future > features is an antipattern. > > > I gave it a try, expecting to end up with something that would lead > > me to say "I tried it, but it didn't work out well"; but actually it > > looks okay to me. I wouldn't say I prefer it, but it seems reasonable, > > and no more complicated (as Peter rightly observes) than the original. > > > > It's up to you and Peter, and whoever has strong feelings about it, > > to choose between them: I don't mind (but I shall be sad if someone > > demands that I indent that comment deeper - I'm not a fan of long > > multi-line comments near column 80). > > I prefer this version because it would make it easier to remove the > "userfaultfd_armed()" check in the future if we have to, but I guess > we could also always change it later if that becomes necessary, so I > don't really have strong feelings on it at this point. Thanks for considering them both, Jann. I do think your trylock way, as in v2, is in principle superior, and we may well have good reason to switch over to it in future; but I find it slightly more confusing, so will follow your and Peter's "no strong feelings" for now, and ask Andrew please to take the original (implicit v1). Overriding reason: I realized overnight that v2 is not quite correct: I was clever enough to realize that nr_ptes needed to be reset to 0 to get the accounting right with a recheck pass, but not clever enough to realize that resetting it to 0 there would likely skip the abort path's flush_tlb_mm(mm), when we actually had cleared entries on the first pass. It needs a separate bool to decide the flush_tlb_mm(mm), or it needs that (ridiculously minor!) step 3 to be moved down. But rather than reworking it, please let's just go with v1 for now. Thanks, Hugh