On 16.02.23 18:32, Sean Christopherson wrote: > On Tue, Feb 14, 2023, Mathias Krause wrote: >> On 13.02.23 18:05, Sean Christopherson wrote: >> However, I still remain confident that this makes sense from a uarch point of >> view. Touching less cache lines should be a win -- even if I'm unable to >> measure it. By preserving more cachelines during a VMEXIT, guests should be >> able to resume their work faster (assuming they still need these cachelines). > > Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in > fewer cache lines being touched, at least not in a persistent, maintainable way. > While every VM-Exit accesses a lot of state, it's most definitely still a small > percentage of kvm_vcpu. And for the VM-Exits that really benefit from optimized > handling, a.k.a. the fastpath exits, that percentage is even lower. Yeah, that's probably true. > On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu, > kvm_vcpu_arch, and vcpu_{vmx,svm}. The majority of fields accessed on every VM-Exit > are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in > the common structures, e.g. registers in kvm_vcpu_arch and things like requests > in kvm_vcpu. > > Naively compating any of the three (well, four) structures might result in fewer > cache lines being touched, but mostly by coincidence. E.g. because compacting > part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the > compaction itself. Fortunately, kvm_vcpu is embedded as first member in vcpu_{vmx,svm}, so all three share a common "header." Optimizations done for kvm_vcpu will therefore benefit the vendor specific structures too. However, you're right that this will implicitly change the layout for the remainder of vcpu_{vmx,svm} and might even have a negative impact regarding cacheline usage. But, as my changes chop off exactly 128 bytes from kvm_vcpu, that's not the case here. But I can see that this is "coincidence" and fragile in the long run. > If we really wanted to optimize cache line usage, our best bet would be to identify > the fields that are accessed in the fastpath run loop and then pack those fields > into as few cache lines as possible. And to do that in a maintainable way, we'd > probably need something like ASI[*] to allow us to detect changes that result in > the fastpath handling accessing fields that aren't in the cache-optimized region. > > I'm not necessarily opposed to such aggressive optimization, but the ROI is likely > very, very low. For optimized workloads, there simply aren't very many VM-Exits, > e.g. the majority of exits on a modern CPU are due to timer ticks. And even those > will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware > virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that > allows for a tickless host. Well, for guests running grsecurity kernels, there's also the CR0.WP toggling triggering VMEXITs, which happens a lot! -- at least until something along the lines of [1] gets merged *hint ;)* [1] https://lore.kernel.org/all/20230201194604.11135-1-minipli@xxxxxxxxxxxxxx/ > > https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx Heh, that RFC is from February last year and it looks like it stalled at that point. But I guess you only meant patch 44 anyway, that splits up kvm_vcpu_arch: https://lore.kernel.org/all/20220223052223.1202152-45-junaids@xxxxxxxxxx/. It does that for other purposes, though, which might conflict with the performance aspect I'm mostly after here. Anyways, I got your point. If we care about cacheline footprint, we should do a more radical change and group hot members together instead of simply shrinking the structs involved. >>> And as you observed, imperfect struct layouts are highly unlikely to have a >>> measurable impact on performance. The types of operations that are involved in >>> a world switch are just too costly for the layout to matter much. I do like to >>> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably >>> more performant, doesn't require ongoing mainteance, and/or also improves the code >>> quality. >> >> Any pointers to measure the "more performant" aspect? > > TL;DR: not really. > > What I've done in the past is run a very tight loop in the guest, and then measure > latency from the host by hacking KVM. Measuring from the guest works, e.g. we have > a variety of selftests that do exactly that, but when trying to shave cycles in > the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to > skew the measurement. My quick-and-dirty solution has always been to just hack > KVM to measure latency with IRQs disabled, but a more maintainable approach would > be to add smarts somewhere to sanitize the results, e.g. to throw away outliers > where the guest likely got interrupted. > > I believe we've talked about adding a selftest to measure fastpath latency, e.g. > by writing MSR_IA32_TSC_DEADLINE in a tight loop. > > However, that's not going to be useful in this case since you are interested in > measuring the impact of reducing the host's L1D footprint. If the guest isn't > cache-constrainted, reducing the host's cache footprint isn't going to impact > performance since there's no contention. Yeah, it's hard to find a test case measuring the gains. I looked into running Linux userland workloads initially, but saw no real impact, as the sdtdev was already too high. But, as you pointed out, a micro-benchmark is of no use either, so it's all hand-waving only. :D > Running a micro benchmark in the guest that aims to measure cache performance might > work, but presumably those are all userspace tests, i.e. you'd end up measuring > the impact of the guest kernel too. And they wouldn't consistently trigger VM-Exits, > so it would be difficult to prove the validity of the results. Jepp. It's all just gut feeling, unfortunately. > I suppose you could write a dedicated selftest or port a micro benchmark to run > as a selftest (or KVM-unit-test)? > >> I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's >> already counting cycles, but the numbers are too unstable, even if I pin the >> test to a given CPU, disable turbo mode, SMT, use the performance cpu >> governor, etc. > > Heh, you might have picked quite possibly the worst way to measure VM-Exit > performance :-) > > The guest code in that test that's measuring latency runs at L2. For VMREADs > and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur > (the access will be handled purely in ucode). And for accesses that do cause a > VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_ > heavy path (~10k cycles). Even if the exit is handled by KVM (in L0), it's still > a relatively slow, heavy path. I see. I'll have a look at the selftests and see if I can repurpose one of them. But, as you noted, a microbenchmark might not be what I'm after. It's more about identifying the usage patterns for hot VMEXIT paths and optimize these. >>> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry >>> a sub-optimal layouy and the change is arguably warranted even without the change >>> in size. Ditto for kvm_pmu, logically I think it makes sense to have the version >>> at the very top. >> >> Yeah, was exactly thinking the same when modifying kvm_pmu. >> >>> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling >>> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly >>> egregious field(s) just isn't worth the cost in the long term. >> >> Heh, just found this gem in vcpu_vmx: >> >> struct vcpu_vmx { >> [...] >> union vmx_exit_reason exit_reason; >> >> /* XXX 44 bytes hole, try to pack */ >> >> /* --- cacheline 123 boundary (7872 bytes) --- */ >> struct pi_desc pi_desc __attribute__((__aligned__(64))); >> [...] >> >> So there are, in fact, some bigger holes left. > > Ya. Again, I'm definitely ok cleaning up the truly heinous warts and/or doing > a targeted, deliberate refactor of structures. What I don't want to do is > shuffle fields around purely to save a few bytes here and there. Got it. I'll back out the reshuffling ones and only keep the ones for kvm_pmu and kvm_mmu_memory_cache, as these are more like straight cleanups. Thanks, Mathias