[PATCH v12 04/16] arm64: kvm: allows kvm cpu hotplug

marc.zyngier@xxxxxxx (Marc Zyngier) · Tue, 15 Dec 2015 10:13:34 +0000

On 15/12/15 09:51, AKASHI Takahiro wrote:
> On 12/15/2015 05:45 PM, Marc Zyngier wrote:
>> On 15/12/15 07:51, AKASHI Takahiro wrote:
>>> On 12/15/2015 02:33 AM, Marc Zyngier wrote:
>>>> On 14/12/15 07:33, AKASHI Takahiro wrote:
>>>>> Marc,
>>>>>
>>>>> On 12/12/2015 01:28 AM, Marc Zyngier wrote:
>>>>>> On 11/12/15 08:06, AKASHI Takahiro wrote:
>>>>>>> Ashwin, Marc,
>>>>>>>
>>>>>>> On 12/03/2015 10:58 PM, Marc Zyngier wrote:
>>>>>>>> On 02/12/15 22:40, Ashwin Chaugule wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On 24 November 2015 at 17:25, Geoff Levand <geoff at infradead.org> wrote:
>>>>>>>>>> From: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>>>>>>>>
>>>>>>>>>> The current kvm implementation on arm64 does cpu-specific initialization
>>>>>>>>>> at system boot, and has no way to gracefully shutdown a core in terms of
>>>>>>>>>> kvm. This prevents, especially, kexec from rebooting the system on a boot
>>>>>>>>>> core in EL2.
>>>>>>>>>>
>>>>>>>>>> This patch adds a cpu tear-down function and also puts an existing cpu-init
>>>>>>>>>> code into a separate function, kvm_arch_hardware_disable() and
>>>>>>>>>> kvm_arch_hardware_enable() respectively.
>>>>>>>>>> We don't need arm64-specific cpu hotplug hook any more.
>>>>>>>>>>
>>>>>>>>>> Since this patch modifies common part of code between arm and arm64, one
>>>>>>>>>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid
>>>>>>>>>> compiling errors.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>>>>>>>> ---
>>>>>>>>>>      arch/arm/include/asm/kvm_host.h   | 10 ++++-
>>>>>>>>>>      arch/arm/include/asm/kvm_mmu.h    |  1 +
>>>>>>>>>>      arch/arm/kvm/arm.c                | 79 ++++++++++++++++++---------------------
>>>>>>>>>>      arch/arm/kvm/mmu.c                |  5 +++
>>>>>>>>>>      arch/arm64/include/asm/kvm_host.h | 16 +++++++-
>>>>>>>>>>      arch/arm64/include/asm/kvm_mmu.h  |  1 +
>>>>>>>>>>      arch/arm64/include/asm/virt.h     |  9 +++++
>>>>>>>>>>      arch/arm64/kvm/hyp-init.S         | 33 ++++++++++++++++
>>>>>>>>>>      arch/arm64/kvm/hyp.S              | 32 ++++++++++++++--
>>>>>>>>>>      9 files changed, 138 insertions(+), 48 deletions(-)
>>>>>>>>>
>>>>>>>>> [..]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      static struct notifier_block hyp_init_cpu_pm_nb = {
>>>>>>>>>> @@ -1108,11 +1119,6 @@ static int init_hyp_mode(void)
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>             /*
>>>>>>>>>> -        * Execute the init code on each CPU.
>>>>>>>>>> -        */
>>>>>>>>>> -       on_each_cpu(cpu_init_hyp_mode, NULL, 1);
>>>>>>>>>> -
>>>>>>>>>> -       /*
>>>>>>>>>>              * Init HYP view of VGIC
>>>>>>>>>>              */
>>>>>>>>>>             err = kvm_vgic_hyp_init();
>>>>>>>>>
>>>>>>>>> With this flow, the cpu_init_hyp_mode() is called only at VM guest
>>>>>>>>> creation, but vgic_hyp_init() is called at bootup. On a system with
>>>>>>>>> GICv3, it looks like we end up with bogus values from the ICH_VTR_EL2
>>>>>>>>> (to get the number of LRs), because we're not reading it from EL2
>>>>>>>>> anymore.
>>>>>>>
>>>>>>> Thank you for pointing this out.
>>>>>>> Recently I tested my kdump code on hikey, and as hikey(hi6220) has gic-400,
>>>>>>> I didn't notice this problem.
>>>>>>
>>>>>> Because GIC-400 is a GICv2 implementation, which is entirely MMIO based.
>>>>>> GICv3 uses some system registers that are only available at EL2, and KVM
>>>>>> needs some information contained in these registers before being able to
>>>>>> get initialized.
>>>>>
>>>>> I see.
>>>>>
>>>>>>>> Indeed, this is completely broken (I just reproduced the issue on a
>>>>>>>> model). I wish this kind of details had been checked earlier, but thanks
>>>>>>>> for pointing it out.
>>>>>>>>
>>>>>>>>> Whats the best way to fix this?
>>>>>>>>> - Call kvm_arch_hardware_enable() before vgic_hyp_init() and disable later?
>>>>>>>>> - Fold the VGIC init stuff back into hardware_enable()?
>>>>>>>>
>>>>>>>> None of that works - kvm_arch_hardware_enable() is called once per CPU,
>>>>>>>> while vgic_hyp_init() can only be called once. Also,
>>>>>>>> kvm_arch_hardware_enable() is called from interrupt context, and I
>>>>>>>> wouldn't feel comfortable starting probing DT and allocating stuff from
>>>>>>>> there.
>>>>>>>
>>>>>>> Do you think so?
>>>>>>> How about the fixup! patch attached below?
>>>>>>> The point is that, like Ashwin's first idea, we initialize cpus temporarily
>>>>>>> before kvm_vgic_hyp_init() and then soon reset cpus again. Thus,
>>>>>>> kvm cpu hotplug will still continue to work as before.
>>>>>>> Now that cpu_init_hyp_mode() is revived as exactly the same as Marc's
>>>>>>> original code, the change will not be a big jump.
>>>>>>
>>>>>> This seems quite complicated:
>>>>>> - init EL2 on  all CPUs
>>>>>> - do some initialization
>>>>>> - tear all CPUs EL2 down
>>>>>> - let KVM drive the vectors being set or not
>>>>>>
>>>>>> My questions are: why do we need to do this on *all* cpus? Can't that
>>>>>> work on a single one?
>>>>>
>>>>> I did initialize all the cpus partly because using preempt_enable/disable
>>>>> looked a bit ugly and partly because we may, in the future, do additional
>>>>> per-cpu initialization in kvm_vgic_hyp_init() and/or kvm_timer_hyp_init().
>>>>> But if you're comfortable with preempt_*() stuff, I don' care.
>>>>>
>>>>>
>>>>>> Also, the simple fact that we were able to get some junk value is a sign
>>>>>> that something is amiss. I'd expect a splat of some sort, because we now
>>>>>> have a possibility of doing things in the wrong context.
>>>>>>
>>>>>>>
>>>>>>> If kvm_hyp_call() in vgic_v3_probe()/kvm_vgic_hyp_init() is a *problem*,
>>>>>>> I hope this should work. Actually I confirmed that, with this fixup! patch,
>>>>>>> we could run a kvm guest and also successfully executed kexec on model w/gic-v3.
>>>>>>>
>>>>>>> My only concern is the following kernel message I saw when kexec shut down
>>>>>>> the kernel:
>>>>>>> (Please note that I was running one kvm quest (pid=961) here.)
>>>>>>>
>>>>>>> ===
>>>>>>> sh-4.3# ./kexec -d -e
>>>>>>> kexec version: 15.11.16.11.06-g41e52e2
>>>>>>> arch_process_options:112: command_line: (null)
>>>>>>> arch_process_options:114: initrd: (null)
>>>>>>> arch_process_options:115: dtb: (null)
>>>>>>> arch_process_options:117: port: 0x0
>>>>>>> kvm: exiting hardware virtualization
>>>>>>> kvm [961]: Unsupported exception type: 6248304    <== this message
>>>>>>
>>>>>> That makes me feel very uncomfortable. It looks like we've exited a
>>>>>> guest with some horrible value in X0. How is that even possible?
>>>>>>
>>>>>> This deserves to be investigated.
>>>>>
>>>>> I guess the problem is that cpu tear-down function is called even if a kvm guest
>>>>> is still running in kvm_arch_vcpu_ioctl_run().
>>>>> So adding a check whether cpu has been initialized or not in every iteration of
>>>>> kvm_arch_vcpu_ioctl_run() will, if necessary, terminate a guest safely without entering
>>>>> a guest mode. Since this check is done while interrupt is disabled, it won't
>>>>> interfere with kvm_arch_hardware_disable() called via IPI.
>>>>> See the attached fixup patch.
>>>>>
>>>>> Again, I verified the code on model.
>>>>>
>>>>> Thanks,
>>>>> -Takahiro AKASHI
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> 	M.
>>>>>>
>>>>>
>>>>> ----8<----
>>>>>    From 77f273ba5e0c3dfcf75a5a8d1da8035cc390250c Mon Sep 17 00:00:00 2001
>>>>> From: AKASHI Takahiro <takahiro.akashi at linaro.org>
>>>>> Date: Fri, 11 Dec 2015 13:43:35 +0900
>>>>> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
>>>>>
>>>>> ---
>>>>>     arch/arm/kvm/arm.c |   45 ++++++++++++++++++++++++++++++++++-----------
>>>>>     1 file changed, 34 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>>>>> index 518c3c7..d7e86fb 100644
>>>>> --- a/arch/arm/kvm/arm.c
>>>>> +++ b/arch/arm/kvm/arm.c
>>>>> @@ -573,7 +573,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>>>>     		/*
>>>>>     		 * Re-check atomic conditions
>>>>>     		 */
>>>>> -		if (signal_pending(current)) {
>>>>> +		if (__hyp_get_vectors() == hyp_default_vectors) {
>>>>> +			/* cpu has been torn down */
>>>>> +			ret = -ENOEXEC;
>>>>> +			run->exit_reason = KVM_EXIT_SHUTDOWN;
>>>>
>>>>
>>>> That feels completely overkill (and very slow). Why don't you maintain a
>>>> per-cpu variable containing the CPU states, which will avoid calling
>>>> __hyp_get_vectors() all the time? You should be able to reuse that
>>>> construct everywhere.
>>>
>>> OK. Since I have introduced per-cpu variable, kvm_arm_hardware_enabled, against
>>> cpuidle issue, we will be able to re-use it.
>>>
>>>> Also, I'm not sure about KVM_EXIT_SHUTDOWN. This looks very x86 specific
>>>> (called on triple fault).
>>>
>>> No, I don't think so.
>>
>> maz at approximate:~/Work/arm-platforms$ git grep KVM_EXIT_SHUTDOWN
>> arch/x86/kvm/svm.c:     kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
>> arch/x86/kvm/vmx.c:     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>> arch/x86/kvm/x86.c:                     vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
>> include/uapi/linux/kvm.h:#define KVM_EXIT_SHUTDOWN         8
>>
>> And that's it. No other architecture ever generates this, and this is
>> an undocumented API. So I'm not going to let that in until someone actually
>> defines what this thing means.
>>
>>> Looking at kvm_cpu_exec() in kvm-all.c of qemu, KVM_EXIT_SHUTDOWN
>>> is handled in a generic way and results in a reset request.
>>
>> Which is not what we want. We want to indicate that the guest couldn't
>> be entered. This is not due to a guest doing a triple fault (which is
>> the way an x86 system gets rebooted).
>>
>>> On the other hand, KVM_EXIT_FAIL_ENTRY seems more arch-specific.
>>
>> Certainly arch specific, but actually extremely accurate. You couldn't
>> enter the guest, and you describe the reason in an architecture-specific
>> fashion. This is also the only exit code that describe this exact case
>> we're talking about here.
>>
>>> In addition, if kvm_vcpu_ioctl() returns a negative value, run->exit_reason
>>> will never be examined.
>>> So I think
>>>      ret -> 0
>>>      run->exit_reason -> KVM_EXIT_SHUTDOWN
>>
>> ret = 0
>> run->exit_reason = KVM_EXIT_FAIL_ENTRY;
>> run->hardware_entry_failure_reason = (u64)-ENOEXEC;
> 
> OK.
> 
>>> or just
>>>      ret -> -ENOEXEC
>>> is the best.
>>>
>>> In either way, a guest will have no good chance to gracefully shutdown itself
>>> because we're kexec'ing (without waiting for threads' termination).
>>
>> Well, at least userspace gets a chance - and should kexec fail, we have
>> a chance to recover.
> 
> Well, the current kexec implementation (on arm64) never fails
> except very early stage :)
> 
> So please review the attached fixup patch, again.
> 
> Thanks,
> -Takahiro AKASHI
> 
> 
>> Thanks,
>>
>> 	M.
>>
> 
> ----8<----
>  From ec6c07fe80d6ba96855468f61daffa9b91cf5622 Mon Sep 17 00:00:00 2001
> From: AKASHI Takahiro <takahiro.akashi at linaro.org>
> Date: Fri, 11 Dec 2015 13:43:35 +0900
> Subject: [PATCH] fixup! arm64: kvm: allows kvm cpu hotplug
> 
> ---
>   arch/arm/kvm/arm.c |   62 +++++++++++++++++++++++++++++++++++-----------------
>   1 file changed, 42 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 518c3c7..05eaa35 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -573,7 +573,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>   		/*
>   		 * Re-check atomic conditions
>   		 */
> -		if (signal_pending(current)) {
> +		if (!__this_cpu_read(kvm_arm_hardware_enabled)) {

if (unlikely(...))

> +			/* cpu has been torn down */
> +			ret = 0;
> +			run->exit_reason = KVM_EXIT_FAIL_ENTRY;
> +			run->fail_entry.hardware_entry_failure_reason
> +					= (u64)-ENOEXEC;
> +		} else if (signal_pending(current)) {
>   			ret = -EINTR;
>   			run->exit_reason = KVM_EXIT_INTR;
>   		}
> @@ -950,7 +956,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   	}
>   }
> 
> -int kvm_arch_hardware_enable(void)
> +static void cpu_init_hyp_mode(void)
>   {
>   	phys_addr_t boot_pgd_ptr;
>   	phys_addr_t pgd_ptr;
> @@ -958,9 +964,6 @@ int kvm_arch_hardware_enable(void)
>   	unsigned long stack_page;
>   	unsigned long vector_ptr;
> 
> -	if (__hyp_get_vectors() != hyp_default_vectors)
> -		return 0;
> -
>   	/* Switch from the HYP stub to our own HYP init vector */
>   	__hyp_set_vectors(kvm_get_idmap_vector());
> 
> @@ -973,24 +976,38 @@ int kvm_arch_hardware_enable(void)
>   	__cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);
> 
>   	kvm_arm_init_debug();
> -
> -	return 0;
>   }
> 
> -void kvm_arch_hardware_disable(void)
> +static void cpu_reset_hyp_mode(void)
>   {
>   	phys_addr_t boot_pgd_ptr;
>   	phys_addr_t phys_idmap_start;
> 
> -	if (__hyp_get_vectors() == hyp_default_vectors)
> -		return;
> -
>   	boot_pgd_ptr = kvm_mmu_get_boot_httbr();
>   	phys_idmap_start = kvm_get_idmap_start();
> 
>   	__cpu_reset_hyp_mode(boot_pgd_ptr, phys_idmap_start);
>   }
> 
> +int kvm_arch_hardware_enable(void)
> +{
> +	if (!__this_cpu_read(kvm_arm_hardware_enabled)) {
> +		cpu_init_hyp_mode();
> +		__this_cpu_write(kvm_arm_hardware_enabled, 1);
> +	}
> +
> +	return 0;
> +}
> +
> +void kvm_arch_hardware_disable(void)
> +{
> +	if (!__this_cpu_read(kvm_arm_hardware_enabled))
> +		return;
> +
> +	cpu_reset_hyp_mode();
> +	__this_cpu_write(kvm_arm_hardware_enabled, 0);
> +}

Are you guaranteed that these two functions are called from a context
where both preemption and interrupts are disabled? If not, you cannot
use the __this_cpu* operations, and you must stick with this_cpu*.

> +
>   #ifdef CONFIG_CPU_PM
>   static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
>   				    unsigned long cmd,
> @@ -998,19 +1015,13 @@ static int hyp_init_cpu_pm_notifier(struct notifier_block *self,
>   {
>   	switch (cmd) {
>   	case CPU_PM_ENTER:
> -		if (__hyp_get_vectors() != hyp_default_vectors)
> -			__this_cpu_write(kvm_arm_hardware_enabled, 1);
> -		else
> -			__this_cpu_write(kvm_arm_hardware_enabled, 0);
> -		/*
> -		 * don't call kvm_arch_hardware_disable() in case of
> -		 * CPU_PM_ENTER because it does't actually save any state.
> -		 */
> +		if (__this_cpu_read(kvm_arm_hardware_enabled))
> +			cpu_reset_hyp_mode();
> 
>   		return NOTIFY_OK;
>   	case CPU_PM_EXIT:
>   		if (__this_cpu_read(kvm_arm_hardware_enabled))
> -			kvm_arch_hardware_enable();
> +			cpu_init_hyp_mode();

It would be good to have a comment as to why we're not directly calling
hardware_enable/disable.

> 
>   		return NOTIFY_OK;
> 
> @@ -1114,9 +1125,20 @@ static int init_hyp_mode(void)
>   	}
> 
>   	/*
> +	 * Init this CPU temporarily to execute kvm_hyp_call()
> +	 * during kvm_vgic_hyp_init().
> +	 */
> +	preempt_disable();
> +	cpu_init_hyp_mode();
> +
> +	/*
>   	 * Init HYP view of VGIC
>   	 */
>   	err = kvm_vgic_hyp_init();
> +
> +	cpu_reset_hyp_mode();
> +	preempt_enable();
> +
>   	if (err)
>   		goto out_free_context;
> 

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...