Re: Another preempt folding issue?

Stefan Bader <stefan.bader@xxxxxxxxxxxxx> · Wed, 12 Feb 2014 09:20:24 +0100

On 11.02.2014 20:45, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 07:34:51PM +0100, Stefan Bader wrote:
>> Hi Peter,
>>
>> I am currently looking at a weird issue that manifest itself when trying to run
>> kvm enabled qemu on a i386 host (v3.13 kernel, oh and potentially important the
>> cpu is 64bit capable, so qemu-system-x86_64 is called). 
> 
> AMD or Intel machine?

Personally I am reproducing this on an AMD:

cpu family	: 15
model		: 72
model name	: AMD Turion(tm) 64 X2 Mobile Technology TL-64

but the reports I got are showing the same issue on an Intel i7

cpu family	: 6
model		: 37
model name	: Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GH

Another reporter also saw this on an AMD and said it could not be reproduced on
the same hardware and the same software versions when using 64bit instead of 32.

In my case on a 32bit installation I will see this on every invocation of

qemu-system-x86_64 -cdrom <32bit live iso> -boot d -m 512

> 
>> Sooner or later this
>> causes softlockup messages on the host. I tracked this down to __vcpu_run in
>> arch/x86/kvm/x86.c which does a loop which in that case never seems to make
>> progress or exit.
>>
>> What I found is that vcpu_enter_guest will exit quickly without causing the loop
>> to exit when need_resched() is true. Looking at a crash dump I took, this was
>> the case (thread_info->flags had TIF_NEED_RESCHED set). So after immediately
>> returning __vcpu_run has the following code:
>>
>>         if (need_resched()) {
>>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>>                 kvm_resched(vcpu); // now cond_resched();
>>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>         }
>>
>> The kvm_resched basically would end up doing a cond_resched() which now checks
>> preempt_count() to be 0. If that is zero it will do the reschedule, otherwise it
>> just does nothing. Looking at the percpu variables in the dump, I saw that
>> the preempt_count was 0x8000000 (actually it was 0x80110000 but that was me
>> triggering the kexec crashdump with sysrq-c).
>>
>> I saw that there have been some changes in the upstream kernel and have picked
>> the following:
>> 1) x86, acpi, idle: Restructure the mwait idle routines
>> 2) x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers
>> 3) sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
>> 4) sched/preempt/x86: Fix voluntary preempt for x86
>>
>> Patch 1) and 2) as dependencies of 3) (to get the mwait function correct and to
>> the other file). Finally 4) is fixing up 3). [maybe worth suggesting to do for
>> 3.13.y stable].
>>
>> Still, with all those I got the softlockup. Since I knew from the dump info that
>> something is wrong with the folding, I made the pragmatic approach and added the
>> following:
>>
>>         if (need_resched()) {
>>                 srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>> +               preempt_fold_need_resched();
>>                 kvm_resched(vcpu); // now cond_resched();
>>                 vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>         }
>>
>> And this lets the kvm guest run without the softlockups! However I am less than
>> convinced that this is the right thing to do. Somehow something done when
>> converting the preempt_count into percpu has caused at least the i386 side to
>> get into this mess (as there has not been any whining about 64bit). Just fail to
>> see what.
> 
> I've been looking at the same thing too; I've got a trace from someone
> who can reproduce and its just not making sense.
> 
> Looks like the hardware is loosing an interrupt (or worse).
> 
> With the below patch on top of current -git (might be whitespace damaged
> due to copy-paste).
> 
> ---
>  arch/x86/kvm/x86.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 39c28f09dfd5..f8b2fc7ce491 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6115,7 +6115,11 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>                 }
>                 if (need_resched()) {
>                         srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
> -                       cond_resched();
> +                       if (need_resched() != should_resched()) {
> +                               tracing_stop();
> +                               printk(KERN_ERR "Stopped tracing, due to inconsistent state.\n");
> +                       }
> +                       schedule();
>                         vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>                 }
>         }
> 
> ---
> 
> I got the following from an Athlon X2 (fam 0xf):
> 
> On Mon, Feb 10, 2014 at 05:23:15PM -0200, Marcelo Tosatti wrote:
>> What is the problem exactly?
> 
> From the trace:
> 
>  qemu-system-x86-2455  [001] d.s3   393.630586: smp_reschedule_interrupt <-reschedule_interrupt
>  qemu-system-x86-2455  [001] d.s3   393.630586: scheduler_ipi <-smp_reschedule_interrupt
>  qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irqrestore
>  qemu-system-x86-2455  [001] ..s2   393.630586: _raw_spin_lock_irq <-run_timer_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_add <-_raw_spin_lock_irq
>  qemu-system-x86-2455  [001] d.s3   393.630586: _raw_spin_unlock_irq <-run_timer_softirq
>  qemu-system-x86-2455  [001] ..s3   393.630586: preempt_count_sub <-_raw_spin_unlock_irq
>  qemu-system-x86-2455  [001] ..s2   393.630586: rcu_bh_qs <-__do_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: irqtime_account_irq <-__do_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: __local_bh_enable <-__do_softirq
>  qemu-system-x86-2455  [001] d.s2   393.630586: preempt_count_sub <-__local_bh_enable
> 
> 
>            dmesg-2458  [000] d.s5   393.630614: resched_task <-check_preempt_wakeup
>            dmesg-2458  [000] d.s5   393.630614: native_smp_send_reschedule <-resched_task
>            dmesg-2458  [000] d.s5   393.630614: default_send_IPI_mask_logical <-native_smp_send_reschedule
> 
> 
>  qemu-system-x86-2455  [001] .n..   393.630636: preempt_count_add <-kmap_atomic_prot
>  qemu-system-x86-2455  [001] .n.1   393.630639: __kunmap_atomic <-clear_huge_page
>  qemu-system-x86-2455  [001] .n.1   393.630639: preempt_count_sub <-__kunmap_atomic
>  qemu-system-x86-2455  [001] .n..   393.630639: _cond_resched <-clear_huge_page
>  qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic <-clear_huge_page
>  qemu-system-x86-2455  [001] .n..   393.630640: kmap_atomic_prot <-kmap_atomic
>  qemu-system-x86-2455  [001] .n..   393.630640: preempt_count_add <-kmap_atomic_prot
> 
> 
> The resched_task() in the middle does:
> 
>   set_tsk_need_resched();
>   smp_mb();
>   smp_send_reschedule();
> 
> This means that the remote cpu must observe TIF_NEED_RESCHED (.n.. in
> the traces) when the IPI (smp_reschedule_interrupt) hits.
> 
> Now given the machine has unsynchronized TSC the order in the above is
> not necessarily the true order, but there's only two possible scenarios:
> 
> 1) its the right order; the smp_reschedule_interrupt() is from a
>  previous resched_task()/rcu kick or any of the other users that sends
>  it.
> 
>  In which case we should get another smp_reschedule_interrupt() _after_
>  the resched_task() which should observe the 'n' and promote it to an
>  'N'.
> 
>  This doesn't happen and we're stuck with the 'n' state..
> 
> 2) its the wrong order and the smp_reschedule_interrupt() actually
>  happened after the resched_task(), but in that case we should have
>  observed the 'n', we did not.
> 
> Now given that 2) would mean a bug in the coherency fabric and 1) would
> mean a lost interrupt, 1 seems like the more likely suspect.
> 
> Joerg wanted the extra kvm tracepoints to see what the vm-exit reason
> is.
> 

Let me think about this after I had another cup of coffee and my brain decides
to follow my body in the process of waking up...

-Stefan

Attachment:
signature.asc

Description: OpenPGP digital signature