On 13.02.23 18:05, Sean Christopherson wrote: > On Mon, Feb 13, 2023, Mathias Krause wrote: >> Relayout members of struct kvm_vcpu and embedded structs to reduce its >> memory footprint. Not that it makes sense from a memory usage point of >> view (given how few of such objects get allocated), but this series >> achieves to make it consume two cachelines less, which should provide a >> micro-architectural net win. However, I wasn't able to see a noticeable >> difference running benchmarks within a guest VM -- the VMEXIT costs are >> likely still high enough to mask any gains. > > ... > >> Below is the high level pahole(1) diff. Most significant is the overall >> size change from 6688 to 6560 bytes, i.e. -128 bytes. > > While part of me wishes KVM were more careful about struct layouts, IMO fiddling > with per vCPU or per VM structures isn't worth the ongoing maintenance cost. > > Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses > a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation, > the memory savings are negligible in the grand scheme. Assuming the kernel is > even capable of perfectly packing vCPU allocations, saving even a few hundred bytes > per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that > point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again > uninteresting. Fully agree! That's why I said, this change makes no sense from a memory usage point of view. The overall memory savings are not visible at all, recognizing that the slab allocator isn't able to put more vCPU objects in a given slab page. However, I still remain confident that this makes sense from a uarch point of view. Touching less cache lines should be a win -- even if I'm unable to measure it. By preserving more cachelines during a VMEXIT, guests should be able to resume their work faster (assuming they still need these cachelines). > And as you observed, imperfect struct layouts are highly unlikely to have a > measurable impact on performance. The types of operations that are involved in > a world switch are just too costly for the layout to matter much. I do like to > shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably > more performant, doesn't require ongoing mainteance, and/or also improves the code > quality. Any pointers to measure the "more performant" aspect? I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's already counting cycles, but the numbers are too unstable, even if I pin the test to a given CPU, disable turbo mode, SMT, use the performance cpu governor, etc. > I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry > a sub-optimal layouy and the change is arguably warranted even without the change > in size. Ditto for kvm_pmu, logically I think it makes sense to have the version > at the very top. Yeah, was exactly thinking the same when modifying kvm_pmu. > But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling > fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly > egregious field(s) just isn't worth the cost in the long term. Heh, just found this gem in vcpu_vmx: struct vcpu_vmx { [...] union vmx_exit_reason exit_reason; /* XXX 44 bytes hole, try to pack */ /* --- cacheline 123 boundary (7872 bytes) --- */ struct pi_desc pi_desc __attribute__((__aligned__(64))); [...] So there are, in fact, some bigger holes left. Would be nice if pahole had a --density flag that would output some ASCII art, visualizing which bytes of a struct are allocated by real members and which ones are pure padding.