On Tue, Apr 19, 2022 at 11:36 AM Oliver Upton <oupton@xxxxxxxxxx> wrote: > > On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@xxxxxxxxxx> wrote: > > > > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@xxxxxxxxxx> wrote: > > > > > > Presently KVM only takes a read lock for stage 2 faults if it believes > > > the fault can be fixed by relaxing permissions on a PTE (write unprotect > > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > > > predictably can pile up all the vCPUs in a sufficiently large VM. > > > > > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an > > > MMU protected by the combination of a read-write lock and RCU, allowing > > > page walkers to traverse in parallel. > > > > > > This series is strongly inspired by the mechanics of the TDP MMU, > > > making use of RCU to protect parallel walks. Note that the TLB > > > invalidation mechanics are a bit different between x86 and ARM, so we > > > need to use the 'break-before-make' sequence to split/collapse a > > > block/table mapping, respectively. > > > > > > Nonetheless, using atomics on the break side allows fault handlers to > > > acquire exclusive access to a PTE (lets just call it locked). Once the > > > PTE lock is acquired it is then safe to assume exclusive access. > > > > > > Special consideration is required when pruning the page tables in > > > parallel. Suppose we are collapsing a table into a block. Allowing > > > parallel faults means that a software walker could be in the middle of > > > a lower level traversal when the table is unlinked. Table > > > walkers that prune the paging structures must now 'lock' all descendent > > > PTEs, effectively asserting exclusive ownership of the substructure > > > (no other walker can install something to an already locked pte). > > > > > > Additionally, for parallel walks we need to punt the freeing of table > > > pages to the next RCU sync, as there could be multiple observers of the > > > table until all walkers exit the RCU critical section. For this I > > > decided to cram an rcu_head into page private data for every table page. > > > We wind up spending a bit more on table pages now, but lazily allocating > > > for rcu callbacks probably doesn't make a lot of sense. Not only would > > > we need a large cache of them (think about installing a level 1 block) > > > to wire up callbacks on all descendent tables, but we also then need to > > > spend memory to actually free memory. > > > > FWIW we used a similar approach in early versions of the TDP MMU, but > > instead of page->private used page->lru so that more metadata could be > > stored in page->private. > > Ultimately that ended up being too limiting and we decided to switch > > to just using the associated struct kvm_mmu_page as the list element. > > I don't know if ARM has an equivalent construct though. > > ARM currently doesn't have any metadata it needs to tie with the table > pages. We just do very basic page reference counting for every valid > PTE. I was going to link together pages (hence the page header), but > we actually do not have a functional need for it at the moment. In > fact, struct page::rcu_head would probably fit the bill and we can > avoid extra metadata/memory for the time being. Ah, right! I page::rcu_head was the field I was thinking of. > > Perhaps best to keep it simple and do the rest when we have a genuine > need for it. Completely agree. I'm surprised that ARM doesn't have a need for a metadata structure associated with each page of the stage 2 paging structure, but if you don't need it, that definitely makes things simpler. > > -- > Thanks, > Oliver