On Mon, 6 Jul 2015 17:59:10 +0800 zhanghailiang <zhang.zhanghailiang@xxxxxxxxxx> wrote: > On 2015/7/6 16:45, Paolo Bonzini wrote: > > > > > > On 06/07/2015 09:54, zhanghailiang wrote: > >> > >> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not > >> consuming any cpu (Should be in idle state), > >> All of VCPUs' stacks in host is like bellow: > >> > >> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] > >> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] > >> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] > >> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] > >> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 > >> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 > >> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b > >> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 > >> [<ffffffffffffffff>] 0xffffffffffffffff > >> > >> We looked into the kernel codes that could leading to the above 'Stuck' > >> warning, in current upstream there isn't any printk(...Stuck...) left since that code path has been reworked. I've often seen this on over-committed host during guest CPUs up/down torture test. Could you update guest kernel to upstream and see if issue reproduces? > >> and found that the only possible is the emulation of 'cpuid' instruct in > >> kvm/qemu has something wrong. > >> But since we can’t reproduce this problem, we are not quite sure. > >> Is there any possible that the cupid emulation in kvm/qemu has some bug ? > > > > Can you explain the relationship to the cpuid emulation? What do the > > traces say about vcpus 1 and 7? > > OK, we searched the VM's kernel codes with the 'Stuck' message, and it is located in > do_boot_cpu(). It's in BSP context, the call process is: > BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. > It will wait 5s for APs to startup, if some AP not startup normally, it will print 'CPU%d Stuck' or 'CPU%d: Not responding'. > > If it prints 'Stuck', it means the AP has received the SIPI interrupt and begins to execute the code > 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before smp_callin()(smpboot.c). > The follow is the starup process of BSP and AP. > BSP: > start_kernel() > ->smp_init() > ->smp_boot_cpus() > ->do_boot_cpu() > ->start_ip = trampoline_address(); //set the address that AP will go to execute > ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU > ->for (timeout = 0; timeout < 50000; timeout++) > if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP startup or not > > APs: > ENTRY(trampoline_data) (trampoline_64.S) > ->ENTRY(secondary_startup_64) (head_64.S) > ->start_secondary() (smpboot.c) > ->cpu_init(); > ->smp_callin(); > ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes here, the BSP will not prints the error message. > > From above call process, we can be sure that, the AP has been stuck between trampoline_data and the cpumask_set_cpu() in > smp_callin(), we look through these codes path carefully, and only found a 'hlt' instruct that could block the process. > It is located in trampoline_data(): > > ENTRY(trampoline_data) > ... > > call verify_cpu # Verify the cpu supports long mode > testl %eax, %eax # Check for return code > jnz no_longmode > > ... > > no_longmode: > hlt > jmp no_longmode > > For the process verify_cpu(), > we can only find the 'cpuid' sensitive instruct that could lead VM exit from No-root mode. > This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to the fail in verify_cpu. > > From the message in VM, we know vcpu1 and vcpu7 is something wrong. > [ 5.060042] CPU1: Stuck ?? > [ 10.170815] CPU7: Stuck ?? > [ 10.171648] Brought up 6 CPUs > > Besides, the follow is the cpus message got from host. > 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command instance-0000000 > * CPU #0: pc=0x00007f64160c683d thread_id=68570 > CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 > CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 > CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 > CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 > CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 > CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 > CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 > > Oh, i also forgot to mention in the above message that, we have bond each vCPU to different physical CPU in > host. > > Thanks, > zhanghailiang > > > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html