On Thu, Aug 08, 2024, Peter Xu wrote: > Hi, Sean, > > On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote: > > On Wed, Aug 07, 2024, Peter Xu wrote: > > > mprotect() does mmu notifiers in PMD levels. It's there since 2014 of > > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to > > > change_pmd_range"). > > > > > > At that time, the issue was that NUMA balancing can be applied on a huge > > > range of VM memory, even if nothing was populated. The notification can be > > > avoided in this case if no valid pmd detected, which includes either THP or > > > a PTE pgtable page. > > > > > > Now to pave way for PUD handling, this isn't enough. We need to generate > > > mmu notifications even on PUD entries properly. mprotect() is currently > > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G > > > mappings already), this is the start to fix it. > > > > > > To fix that, this patch proposes to push such notifications to the PUD > > > layers. > > > > > > There is risk on regressing the problem Rik wanted to resolve before, but I > > > think it shouldn't really happen, and I still chose this solution because > > > of a few reasons: > > > > > > 1) Consider a large VM that should definitely contain more than GBs of > > > memory, it's highly likely that PUDs are also none. In this case there > > > > I don't follow this. Did you mean to say it's highly likely that PUDs are *NOT* > > none? > > I did mean the original wordings. > > Note that in the previous case Rik worked on, it's about a mostly empty VM > got NUMA hint applied. So I did mean "PUDs are also none" here, with the > hope that when the numa hint applies on any part of the unpopulated guest > memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD > mapping as long as the guest memory is not backed by DAX (since only DAX > supports 1G huge pud so far, while hugetlb has its own path here in > mprotect, so it must be things like anon or shmem), but a PUD entry that > contains pmd pgtables. For that part, I was trying to justify "no pmd > pgtable installed" with the fact that "a large VM that should definitely > contain more than GBs of memory", it means the PUD range should hopefully > never been accessed, so even the pmd pgtable entry should be missing. Ah, now I get what you were saying. Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn. And that walk is done without ever yielding, which I suspect is the source of the soft lockups of yore. And there's no way around that conundrum (walking rmaps), at least not without a major rewrite in KVM. In a nested TDP scenario, KVM's stage-2 page tables (for L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is through the rmaps.