Re: [PATCH v3] KVM: s390: fix gisa destroy operation might lead to cpu stalls

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/28/23 11:15 AM, Michael Mueller wrote:
> A GISA cannot be destroyed as long it is linked in the GIB alert list
> as this would break the alert list. Just waiting for its removal from
> the list triggered by another vm is not sufficient as it might be the
> only vm. The below shown cpu stall situation might occur when GIB alerts
> are delayed and is fixed by calling process_gib_alert_list() instead of
> waiting.
> 
> At this time the vcpus of the vm are already destroyed and thus
> no vcpu can be kicked to enter the SIE again if for some reason an
> interrupt is pending for that vm.
> 
> Additionally the IAM restore value is set to 0x00. That would be a bug
> introduced by incomplete device de-registration, i.e. missing
> kvm_s390_gisc_unregister() call.
> 
> Setting this value and the IAM in the GISA to 0x00 guarantees that late
> interrupts don't bring the GISA back into the alert list.
> 
> CPU stall caused by kvm_s390_gisa_destroy():
> 
>  [ 4915.311372] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 14-.... } 24533 jiffies s: 5269 root: 0x1/.
>  [ 4915.311390] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/.
>  [ 4915.311394] Task dump for CPU 14:
>  [ 4915.311395] task:qemu-system-s39 state:R  running task     stack:0     pid:217198 ppid:1      flags:0x00000045
>  [ 4915.311399] Call Trace:
>  [ 4915.311401]  [<0000038003a33a10>] 0x38003a33a10
>  [ 4933.861321] rcu: INFO: rcu_sched self-detected stall on CPU
>  [ 4933.861332] rcu: 	14-....: (42008 ticks this GP) idle=53f4/1/0x4000000000000000 softirq=61530/61530 fqs=14031
>  [ 4933.861353] rcu: 	(t=42008 jiffies g=238109 q=100360 ncpus=18)
>  [ 4933.861357] CPU: 14 PID: 217198 Comm: qemu-system-s39 Not tainted 6.5.0-20230816.rc6.git26.a9d17c5d8813.300.fc38.s390x #1
>  [ 4933.861360] Hardware name: IBM 8561 T01 703 (LPAR)
>  [ 4933.861361] Krnl PSW : 0704e00180000000 000003ff804bfc66 (kvm_s390_gisa_destroy+0x3e/0xe0 [kvm])
>  [ 4933.861414]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>  [ 4933.861416] Krnl GPRS: 0000000000000000 00000372000000fc 00000002134f8000 000000000d5f5900
>  [ 4933.861419]            00000002f5ea1d18 00000002f5ea1d18 0000000000000000 0000000000000000
>  [ 4933.861420]            00000002134fa890 00000002134f8958 000000000d5f5900 00000002134f8000
>  [ 4933.861422]            000003ffa06acf98 000003ffa06858b0 0000038003a33c20 0000038003a33bc8
>  [ 4933.861430] Krnl Code: 000003ff804bfc58: ec66002b007e	cij	%r6,0,6,000003ff804bfcae
>                            000003ff804bfc5e: b904003a		lgr	%r3,%r10
>                           #000003ff804bfc62: a7f40005		brc	15,000003ff804bfc6c
>                           >000003ff804bfc66: e330b7300204	lg	%r3,10032(%r11)
>                            000003ff804bfc6c: 58003000		l	%r0,0(%r3)
>                            000003ff804bfc70: ec03fffb6076	crj	%r0,%r3,6,000003ff804bfc66
>                            000003ff804bfc76: e320b7600271	lay	%r2,10080(%r11)
>                            000003ff804bfc7c: c0e5fffea339	brasl	%r14,000003ff804942ee
>  [ 4933.861444] Call Trace:
>  [ 4933.861445]  [<000003ff804bfc66>] kvm_s390_gisa_destroy+0x3e/0xe0 [kvm]
>  [ 4933.861460] ([<00000002623523de>] free_unref_page+0xee/0x148)
>  [ 4933.861507]  [<000003ff804aea98>] kvm_arch_destroy_vm+0x50/0x120 [kvm]
>  [ 4933.861521]  [<000003ff8049d374>] kvm_destroy_vm+0x174/0x288 [kvm]
>  [ 4933.861532]  [<000003ff8049d4fe>] kvm_vm_release+0x36/0x48 [kvm]
>  [ 4933.861542]  [<00000002623cd04a>] __fput+0xea/0x2a8
>  [ 4933.861547]  [<00000002620d5bf8>] task_work_run+0x88/0xf0
>  [ 4933.861551]  [<00000002620b0aa6>] do_exit+0x2c6/0x528
>  [ 4933.861556]  [<00000002620b0f00>] do_group_exit+0x40/0xb8
>  [ 4933.861557]  [<00000002620b0fa6>] __s390x_sys_exit_group+0x2e/0x30
>  [ 4933.861559]  [<0000000262d481f4>] __do_syscall+0x1d4/0x200
>  [ 4933.861563]  [<0000000262d59028>] system_call+0x70/0x98
>  [ 4933.861565] Last Breaking-Event-Address:
>  [ 4933.861566]  [<0000038003a33b60>] 0x38003a33b60
> 
> Fixes: 9f30f6216378 ("KVM: s390: add gib_alert_irq_handler()")
> Signed-off-by: Michael Mueller <mimu@xxxxxxxxxxxxx>
> ---
>  arch/s390/kvm/interrupt.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
> index 85e39f472bb4..75e200bd1030 100644
> --- a/arch/s390/kvm/interrupt.c
> +++ b/arch/s390/kvm/interrupt.c
> @@ -3216,11 +3216,12 @@ void kvm_s390_gisa_destroy(struct kvm *kvm)
>  
>  	if (!gi->origin)
>  		return;
> -	if (gi->alert.mask)
> -		KVM_EVENT(3, "vm 0x%pK has unexpected iam 0x%02x",
> -			  kvm, gi->alert.mask);
> -	while (gisa_in_alert_list(gi->origin))
> -		cpu_relax();
> +	WARN(gi->alert.mask != 0x00,
> +	     "unexpected non zero alert.mask 0x%02x",
> +	     gi->alert.mask);
> +	gi->alert.mask = 0x00;
> +	if (gisa_set_iam(gi->origin, gi->alert.mask))
> +		process_gib_alert_list();
>  	hrtimer_cancel(&gi->timer);

Thanks for the prior explanations.  This looks pretty good to me now, I think the subtlety that I was missing is that we are kicking off the callback (gisa_vcpu_kicker) via hrtimer_start with an immediate expiry (0) and relying on the fact that this hrtimer_cancel here will wait until that callback has finished.  AFAIU that means that now we will either set the IAM immediately here via gisa_set_iam or via the callback after handling the alert; in both cases this will prevent further alerts and we won't clear gi->origin until after that point.

Reviewed-by: Matthew Rosato <mjrosato@xxxxxxxxxxxxx>


>  	gi->origin = NULL;
>  	VM_EVENT(kvm, 3, "gisa 0x%pK destroyed", gisa);




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux