On 21.05.2021 09:05, Maciej S. Szmigiero wrote:
On 20.05.2021 00:31, Sean Christopherson wrote:
On Sun, May 16, 2021, Maciej S. Szmigiero wrote:
(..)
new_size = old_size;
slots = kvzalloc(new_size, GFP_KERNEL_ACCOUNT);
- if (likely(slots))
- memcpy(slots, old, old_size);
+ if (unlikely(!slots))
+ return NULL;
+
+ memcpy(slots, old, old_size);
+
+ hash_init(slots->id_hash);
+ kvm_for_each_memslot(memslot, slots)
+ hash_add(slots->id_hash, &memslot->id_node, memslot->id);
What's the perf penalty if the number of memslots gets large? I ask because the
lazy rmap allocation is adding multiple calls to kvm_dup_memslots().
I would expect the "move inactive" benchmark to be closest to measuring
the performance of just a memslot array copy operation but the results
suggest that the performance stays within ~10% window from 10 to 509
memslots on the old code (it then climbs 13x for 32k case).
That suggests that something else is dominating this benchmark for these
memslot counts (probably zapping of shadow pages).
At the same time, the tree-based memslots implementation is clearly
faster in this benchmark, even for smaller memslot counts, so apparently
copying of the memslot array has some performance impact, too.
Measuring just kvm_dup_memslots() performance would probably be done
best by benchmarking KVM_MR_FLAGS_ONLY operation - will try to add this
operation to my set of benchmarks and see how it performs with different
memslot counts.
Update:
I've implemented a simple KVM_MR_FLAGS_ONLY benchmark, that repeatably
sets and unsets KVM_MEM_LOG_DIRTY_PAGES flag on a memslot with a single
page of memory in it. [1]
Since on the current code with higher memslot counts the "set flags"
operation spends a significant time in kvm_mmu_calculate_default_mmu_pages()
a second set of measurements was done with patch [2] applied.
In this case, the top functions in the perf trace are "memcpy" and
"clear_page" (called from kvm_set_memslot(), most likely from inlined
kvm_dup_memslots()).
For reference, a set of measurements with the whole patch series
(patches 1 - 8) applied was also done, as "new code".
In this case, SRCU-related functions dominate the perf trace.
32k memslots:
Current code: 0.00130s
Current code + patch [2]: 0.00104s (13x 4k result)
New code: 0.0000144s
4k memslots:
Current code: 0.0000899s
Current code + patch [2]: 0.0000799s (+78% 2k result)
New code: 0.0000144s
2k memslots:
Current code: 0.0000495s
Current code + patch [2]: 0.0000447s (+54% 509 result)
New code: 0.0000143s
509 memslots:
Current code: 0.0000305s
Current code + patch [2]: 0.0000290s (+5% 100 result)
New code: 0.0000141s
100 memslots:
Current code: 0.0000280s
Current code + patch [2]: 0.0000275s (same as for 10 slots)
New code: 0.0000142s
10 memslots:
Current code: 0.0000272s
Current code + patch [2]: 0.0000272s
New code: 0.0000141s
Thanks,
Maciej
[1]: The patch against memslot_perf_test.c is available here:
https://github.com/maciejsszmigiero/linux/commit/841e94898a55ff79af9d20a08205aa80808bd2a8
[2]: "[PATCH v3 1/8] KVM: x86: Cache total page count to avoid traversing the memslot array"