Right after a vPE is made resident, the code starts polling the GICR_VPENDBASER.Dirty bit until it becomes 0, where the delay_us is set to 10. But in our measurement, it takes only hundreds of nanoseconds, or 1~2 microseconds, to finish parsing the VPT in most cases. What's more, we found that the MMIO delay on GICv4.1 system (HiSilicon) is about 10 times higher than that on GICv4.0 system in kvm-unit-tests (the specific data is as follows). | GICv4.1 emulator | GICv4.0 emulator mmio_read_user (ns) | 12811 | 1598 After analysis, this is mainly caused by the 10 delay_us, so it might really hurt performance. To avoid this, we can set the delay_us to 1, which is more appropriate in this situation and universal. Besides, we can delay the execution of the polling, giving the GIC a chance to work in parallel with the CPU on the entry path. Shenming Lu (2): irqchip/gic-v4.1: Reduce the delay time of the poll on the GICR_VPENDBASER.Dirty bit KVM: arm64: Delay the execution of the polling on the GICR_VPENDBASER.Dirty bit arch/arm64/kvm/vgic/vgic-v4.c | 16 ++++++++++++++++ arch/arm64/kvm/vgic/vgic.c | 3 +++ drivers/irqchip/irq-gic-v3-its.c | 18 +++++++++++++----- drivers/irqchip/irq-gic-v4.c | 11 +++++++++++ include/kvm/arm_vgic.h | 3 +++ include/linux/irqchip/arm-gic-v4.h | 4 ++++ 6 files changed, 50 insertions(+), 5 deletions(-) -- 2.23.0