On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@xxxxxxxxxx> wrote: > > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@xxxxxxxxxx> wrote: > > > > Presently KVM only takes a read lock for stage 2 faults if it believes > > the fault can be fixed by relaxing permissions on a PTE (write unprotect > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which > > predictably can pile up all the vCPUs in a sufficiently large VM. > > > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an > > MMU protected by the combination of a read-write lock and RCU, allowing > > page walkers to traverse in parallel. > > > > This series is strongly inspired by the mechanics of the TDP MMU, > > making use of RCU to protect parallel walks. Note that the TLB > > invalidation mechanics are a bit different between x86 and ARM, so we > > need to use the 'break-before-make' sequence to split/collapse a > > block/table mapping, respectively. > > > > Nonetheless, using atomics on the break side allows fault handlers to > > acquire exclusive access to a PTE (lets just call it locked). Once the > > PTE lock is acquired it is then safe to assume exclusive access. > > > > Special consideration is required when pruning the page tables in > > parallel. Suppose we are collapsing a table into a block. Allowing > > parallel faults means that a software walker could be in the middle of > > a lower level traversal when the table is unlinked. Table > > walkers that prune the paging structures must now 'lock' all descendent > > PTEs, effectively asserting exclusive ownership of the substructure > > (no other walker can install something to an already locked pte). > > > > Additionally, for parallel walks we need to punt the freeing of table > > pages to the next RCU sync, as there could be multiple observers of the > > table until all walkers exit the RCU critical section. For this I > > decided to cram an rcu_head into page private data for every table page. > > We wind up spending a bit more on table pages now, but lazily allocating > > for rcu callbacks probably doesn't make a lot of sense. Not only would > > we need a large cache of them (think about installing a level 1 block) > > to wire up callbacks on all descendent tables, but we also then need to > > spend memory to actually free memory. > > FWIW we used a similar approach in early versions of the TDP MMU, but > instead of page->private used page->lru so that more metadata could be > stored in page->private. > Ultimately that ended up being too limiting and we decided to switch > to just using the associated struct kvm_mmu_page as the list element. > I don't know if ARM has an equivalent construct though. ARM currently doesn't have any metadata it needs to tie with the table pages. We just do very basic page reference counting for every valid PTE. I was going to link together pages (hence the page header), but we actually do not have a functional need for it at the moment. In fact, struct page::rcu_head would probably fit the bill and we can avoid extra metadata/memory for the time being. Perhaps best to keep it simple and do the rest when we have a genuine need for it. -- Thanks, Oliver