On Thu, Jan 26, 2023 at 10:45 AM Ricardo Koller <ricarkol@xxxxxxxxxx> wrote: > > On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > > > Hi Ricardo, > > > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote: > > > Split huge pages eagerly when enabling dirty logging. The goal is to > > > avoid doing it while faulting on write-protected pages, which > > > negatively impacts guest performance. > > > > > > A memslot marked for dirty logging is split in 1GB pieces at a time. > > > This is in order to release the mmu_lock and give other kernel threads > > > the opportunity to run, and also in order to allocate enough pages to > > > split a 1GB range worth of huge pages (or a single 1GB huge page). > > > Note that these page allocations can fail, so eager page splitting is > > > best-effort. This is not a correctness issue though, as huge pages > > > can still be split on write-faults. > > > > > > The benefits of eager page splitting are the same as in x86, added > > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by > > > the TDP MMU when dirty logging is enabled"). For example, when running > > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, > > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access > > > all of their memory after dirty logging is enabled decreased by 44% > > > from 2.58s to 1.42s. > > > > > > Signed-off-by: Ricardo Koller <ricarkol@xxxxxxxxxx> > > > --- > > > arch/arm64/include/asm/kvm_host.h | 30 ++++++++ > > > arch/arm64/kvm/mmu.c | 110 +++++++++++++++++++++++++++++- > > > 2 files changed, 138 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > index 35a159d131b5..6ab37209b1d1 100644 > > > --- a/arch/arm64/include/asm/kvm_host.h > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu { > > > /* The last vcpu id that ran on each physical CPU */ > > > int __percpu *last_vcpu_ran; > > > > > > + /* > > > + * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge > > > + * pages. It is used to allocate stage2 page tables while splitting > > > + * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY. > > > + * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both > > > + * the capacity of the split page cache (CACHE_CAPACITY), and how often > > > + * KVM reschedules. Be wary of raising CHUNK_SIZE too high. > > > + * > > > + * A good heuristic to pick CHUNK_SIZE is that it should be larger than > > > + * all the available huge-page sizes, and be a multiple of all the > > > + * other ones; for example, 1GB when all the available huge-page sizes > > > + * are (1GB, 2MB, 32MB, 512MB). > > > + * > > > + * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for > > > + * example, 1GB requires the following number of PAGE_SIZE-pages: > > > + * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB). > > > + * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)). > > > + * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB). > > > + * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB). > > > + * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB > > > + * granules. > > > + * > > > + * Protected by kvm->slots_lock. > > > + */ > > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE SZ_1G > > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY \ > > > + (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) + \ > > > + DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M)) > > > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL > > as the batch size? 513 pages across all page sizes is a non-negligible > > amount of memory that goes largely unused when PAGE_SIZE != 4K. > > > > Sounds good, will refine this for v2. > > > With that change it is a lot easier to correctly match the cache > > capacity to the selected page size. Additionally, we continue to have a > > single set of batching logic that we can improve later on. > > > > > + struct kvm_mmu_memory_cache split_page_cache; > > > + > > > struct kvm_arch *arch; > > > }; > > > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > index 700c5774b50d..41ee330edae3 100644 > > > --- a/arch/arm64/kvm/mmu.c > > > +++ b/arch/arm64/kvm/mmu.c > > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector; > > > > > > static unsigned long io_map_base; > > > > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > > +bool __read_mostly eager_page_split = true; > > > +module_param(eager_page_split, bool, 0644); > > > + > > > > Unless someone is really begging for it I'd prefer we not add a module > > parameter for this. > > It was mainly to match x86 and because it makes perf testing a bit > simpler. What do others think? > > > > > > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end, > > > + phys_addr_t size) > > > { > > > - phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > > > phys_addr_t boundary = ALIGN_DOWN(addr + size, size); > > > > > > return (boundary - 1 < end - 1) ? boundary : end; > > > } > > > > > > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end) > > > +{ > > > + phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL); > > > + > > > + return __stage2_range_addr_end(addr, end, size); > > > +} > > > + > > > /* > > > * Release kvm_mmu_lock periodically if the memory region is large. Otherwise, > > > * we may see kernel panics with CONFIG_DETECT_HUNG_TASK, > > > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr, > > > return ret; > > > } > > > > > > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) > > > +{ > > > + return kvm_mmu_memory_cache_nr_free_objects(cache) < min; > > > +} > > > > I don't think the helper is adding too much here. > > Will try how it looks without. > > > > > > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm) > > > +{ > > > + struct kvm_mmu_memory_cache *cache; > > > + > > > + if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) > > > + return true; > > > + > > > + cache = &kvm->arch.mmu.split_page_cache; > > > + return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY); > > > +} > > > + > > > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr, > > > + phys_addr_t end) > > > +{ > > > + struct kvm_mmu_memory_cache *cache; > > > + struct kvm_pgtable *pgt; > > > + int ret; > > > + u64 next; > > > + int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY; > > > + > > > + lockdep_assert_held_write(&kvm->mmu_lock); > > > > Rather than having the caller acquire the lock, can you instead do it > > here? It would appear that the entire critical section is enclosed > > within this function. > > Sure. I will first double check things related to perf and correctness > just in case. > I'm not sure if the increase of acquire/releases makes any difference in perf. > Also, not sure if there's a correctness issue because of releasing the lock > between WP and split (I think it should be fine, but not 100% sure). > > > > > > + lockdep_assert_held(&kvm->slots_lock); > > > > This function doesn't depend on anything guarded by the slots_lock, can > > you move this to kvm_mmu_split_memory_region()? > > kvm_mmu_split_memory_region() takes a memslot. > That works in this case, eager splitting when enabling dirty logging, but won't > work in the next commit when spliting on the CLEAR ioctl. > Ahh, you meant just the "lockdep" line. Yes, that makes sense. Will do. > > > > > + cache = &kvm->arch.mmu.split_page_cache; > > > + > > > + do { > > > + if (need_topup_split_page_cache_or_resched(kvm)) { > > > + write_unlock(&kvm->mmu_lock); > > > + cond_resched(); > > > + /* Eager page splitting is best-effort. */ > > > + ret = __kvm_mmu_topup_memory_cache(cache, > > > + cache_capacity, > > > + cache_capacity); > > > + write_lock(&kvm->mmu_lock); > > > + if (ret) > > > + break; > > > + } > > > + > > > + pgt = kvm->arch.mmu.pgt; > > > + if (!pgt) > > > + return -EINVAL; > > > + > > > + next = __stage2_range_addr_end(addr, end, > > > + EAGER_PAGE_SPLIT_CHUNK_SIZE); > > > + ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache); > > > + if (ret) > > > + break; > > > + } while (addr = next, addr != end); > > > + > > > + return ret; > > > +} > > > > -- > > Thanks, > > Oliver