On 2020-10-23 02:26, Shaokun Zhang wrote:
Hi Marc,
在 2020/10/22 20:03, Marc Zyngier 写道:
On 2020-10-22 02:57, Shaokun Zhang wrote:
From: Nianyao Tang <tangnianyao@xxxxxxxxxx>
Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
even those vcpu do not resident on. It would get worse as system
gets larger and more pcpus get online.
Let's disable force-broadcast. We flush tlbi when move vcpu to a
new pcpu, in case old page mapping still exists on new pcpu.
Cc: Marc Zyngier <maz@xxxxxxxxxx>
Signed-off-by: Nianyao Tang <tangnianyao@xxxxxxxxxx>
Signed-off-by: Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx>
---
arch/arm64/include/asm/kvm_arm.h | 2 +-
arch/arm64/kvm/arm.c | 4 +++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_arm.h
b/arch/arm64/include/asm/kvm_arm.h
index 64ce29378467..f85ea9c649cb 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -75,7 +75,7 @@
* PTW: Take a stage2 fault if a stage1 walk steps in device
memory
*/
#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI |
HCR_VM | \
- HCR_BSU_IS | HCR_FB | HCR_TAC | \
+ HCR_BSU_IS | HCR_TAC | \
HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
HCR_FMO | HCR_IMO | HCR_PTW )
#define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index acf9a993dfb6..845be911f885 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu,
int cpu)
/*
* We might get preempted before the vCPU actually runs, but
* over-invalidation doesn't affect correctness.
+ * Dirty tlb might still exist when vcpu ran on other pcpu
+ * and modified page mapping.
*/
- if (*last_ran != vcpu->vcpu_id) {
+ if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
*last_ran = vcpu->vcpu_id;
}
This breaks uniprocessor semantics for I-cache invalidation. What
could
Oops, It shall consider this.
possibly go wrong? You also fail to provide any data that would back
up
your claim, as usual.
Test Unixbench spawn in each 4U8G vm on Kunpeng 920:
(1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure
each vm seperated.
That's a very narrow use case.
(2) 1 4U8G VM
Result:
(1) 800 * 24
(2) 3200
I don't know what these numbers are.
By restricting DVM broadcasting across NUMA, result (1) can
be improved to 2300 * 24. In spawn testcase, tlbiis used
in creating process.
Linux always use TLBI IS, except in some very rare cases.
If your HW broadcasts invalidations across the whole system
without any filtering, it is bound to be bad.
Further, we consider restricting tlbi broadcast range and make
tlbi more accurate.
One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0.
Ah. That old thing again. No, thank you. The right thing to do
is to build CPUs and interconnects that properly deals with
IS invalidations by not sending DVM messages to places where
things don't run (snoop filters?). I also doubt the arm64
maintainers would be sympathetic to the idea anyway.
Consider I-cache invalidation, KVM also needs to invalid icache
when moving vcpu to a new pcpu.
Do we miss any cases that need HCR_EL2.FB == 1?
Eventually we expect HCR_EL2.FB == 0 if it is possible.
I have no intention to ever set FB to 0, as this would result
in over-invalidation in the general case, and worse performance
on well designed systems.
M.
--
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm