[arch/x86][tsc_sync] The VM is stuck when we set a hot-added vCPU online.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We found that when we bring a hot-added vCPU online on a Guest of Ubuntu server 17.10 64bit, which is running on the KVM platform, this Guest will be stuck.

logs were as follows:
...
CPU2 has been hot-added
CPU3 has been hot-added
smpboot: Booting Node 0 Processor 2 APIC 0x2
kvm-clock: cpu 2, msr 4:3ff55081, secondary cpu clock
TSC ADJUST compensate: CPU2 observed 249396676850 warp. Adjust: 249396676850
TSC ADJUST compensate: CPU2 observed 172 warp. Adjust: 249396677022
INFO: rcu_sched self-detected stall on CPU
INFO: rcu_sched detected stalls on CPUs/tasks:
0-...: (1 ticks this GP) idle=93e/140000000000001/0 softirq=11320/11320 fqs=0
         (detected by 1, t=25008 jiffies, g=3837, c=3836, q=4)
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 1762 Comm: bash Not tainted 4.13.0-16-generic #19-Ubuntu
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-20170718_152359-build9a64a246a225 04/01/2014
task: ffff90782caadd00 task.stack: ffffaeab03134000
RIP: 0010:io_serial_out+0x1a/0x20
RSP: 0018:ffff90783fc03c20 EFLAGS: 00000002
RAX: 000000000000005b RBX: ffffffffaa0e4840 RCX: 0000000000000000
RDX: 00000000000003f8 RSI: 0000000000000000 RDI: ffffffffaa0e4840
RBP: ffff90783fc03c20 R08: 0000000000000000 R09: 0000000000000223
R10: 000000000000003a R11: ffffffffa9f3e1ad R12: 000000000000005b
R13: ffffffffa9f3e1da R14: ffffffffa9f3e1a1 R15: ffffffffaa0e4840
FS:  00007f50a0724b80(0000) GS:ffff90783fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffc402cd7d8 CR3: 0000000426079000 CR4: 00000000001406f0
Call Trace:
 <IRQ>
 serial8250_console_putchar+0x27/0x30
 ? wait_for_xmitr+0xa0/0xa0
 uart_console_write+0x48/0x60
 serial8250_console_write+0xfb/0x280
 ? msg_print_text+0x80/0x100
 univ8250_console_write+0x28/0x30
 console_unlock+0x36c/0x4d0
 vprintk_emit+0x333/0x3a0
 vprintk_default+0x29/0x50
 vprintk_func+0x27/0x60
 printk+0x52/0x6e
 rcu_check_callbacks+0x499/0x870
 ? account_system_index_time+0x63/0x70
 ? tick_sched_do_timer+0x50/0x50
 update_process_times+0x2f/0x60
 tick_sched_handle+0x26/0x70
 ? tick_sched_do_timer+0x3f/0x50
 tick_sched_timer+0x39/0x80
 __hrtimer_run_queues+0xdd/0x220
 hrtimer_interrupt+0xa0/0x1d0
 local_apic_timer_interrupt+0x38/0x60
 smp_apic_timer_interrupt+0x38/0x50
 apic_timer_interrupt+0x89/0x90
RIP: 0010:native_cpu_up+0x3d5/0x920
RSP: 0018:ffffaeab03137c50 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff90783fc9cd40 RDI: 0000000000000246
RBP: ffffaeab03137c90 R08: 0000000000000001 R09: 0000000000000000
R10: ffff90782f021a68 R11: 00000000000003ec R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000246 R15: 0000000000000000
 </IRQ>
 ? takedown_cpu+0xf0/0xf0
 bringup_cpu+0x2a/0xe0
 cpuhp_invoke_callback+0x84/0x3b0
 ? zswap_cpu_comp_dead+0x60/0x60
 cpuhp_up_callbacks+0x36/0xc0
 _cpu_up+0x87/0xd0
 do_cpu_up+0x8b/0xb0
 cpu_up+0x13/0x20
 cpu_subsys_online+0x3d/0x90
 device_online+0x4a/0x90
 online_store+0x89/0xa0
 dev_attr_store+0x18/0x30
 sysfs_kf_write+0x37/0x40
 kernfs_fop_write+0x11c/0x1a0
 __vfs_write+0x18/0x40
 vfs_write+0xb1/0x1a0
 SyS_write+0x55/0xc0
 entry_SYSCALL_64_fastpath+0x1e/0xa9
RIP: 0033:0x7f509fe00054
RSP: 002b:00007ffc402cd4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f509fe00054
RDX: 0000000000000002 RSI: 00000000011f7008 RDI: 0000000000000001
RBP: 0000000000000001 R08: 000000000136b028 R09: 0000000000000001
R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc402cda30 R14: 0000000000000000 R15: 0000000000000000
Code: 55 48 89 e5 d3 e6 01 f2 ec 0f b6 c0 5d c3 0f 1f 00 0f 1f 44 00 00 0f b6 8f a1 00 00 00 89 d0 8b 57 08 55 48 89 e5 d3 e6 01 f2 ee <5d> c3 0f 1f 40 00 0f 1f 44 00 00 0f b6 87 a2 00 00 00 55 48 c7 rcu_sched kthread starved for 25008 jiffies! g3837 c3836 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
rcu_sched       S    0     8      2 0x00000000
Call Trace:
 __schedule+0x28b/0x890
 schedule+0x36/0x80
 schedule_timeout+0x182/0x350
 ? call_timer_fn+0x130/0x130
 rcu_gp_kthread+0x561/0x980
 ? rcu_gp_kthread+0x561/0x980
 kthread+0x125/0x140
 ? rcu_note_context_switch+0x150/0x150
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x25/0x30
0-...: (1 ticks this GP) idle=93e/140000000000001/0 softirq=11320/11320 fqs=1
          (t=25008 jiffies g=3837 c=3836 q=272)
NMI backtrace for cpu 0
CPU: 0 PID: 1762 Comm: bash Not tainted 4.13.0-16-generic #19-Ubuntu
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-20170718_152359-build9a64a246a225 04/01/2014
Call Trace:
 <IRQ>
 dump_stack+0x63/0x8b
 nmi_cpu_backtrace+0xca/0xd0
 ? irq_force_complete_move+0x130/0x130
 nmi_trigger_cpumask_backtrace+0xe6/0x120
 arch_trigger_cpumask_backtrace+0x19/0x20
 rcu_dump_cpu_stacks+0xa8/0xe4
 rcu_check_callbacks+0x655/0x870
 ? account_system_index_time+0x63/0x70
 ? tick_sched_do_timer+0x50/0x50
 update_process_times+0x2f/0x60
 tick_sched_handle+0x26/0x70
 ? tick_sched_do_timer+0x3f/0x50
 tick_sched_timer+0x39/0x80
 __hrtimer_run_queues+0xdd/0x220
 hrtimer_interrupt+0xa0/0x1d0
 local_apic_timer_interrupt+0x38/0x60
 smp_apic_timer_interrupt+0x38/0x50
 apic_timer_interrupt+0x89/0x90
RIP: 0010:native_cpu_up+0x3d5/0x920
RSP: 0018:ffffaeab03137c50 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff90783fc9cd40 RDI: 0000000000000246
RBP: ffffaeab03137c90 R08: 0000000000000001 R09: 0000000000000000
R10: ffff90782f021a68 R11: 00000000000003ec R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000246 R15: 0000000000000000
 </IRQ>
 ? takedown_cpu+0xf0/0xf0
 bringup_cpu+0x2a/0xe0
 cpuhp_invoke_callback+0x84/0x3b0
 ? zswap_cpu_comp_dead+0x60/0x60
 cpuhp_up_callbacks+0x36/0xc0
 _cpu_up+0x87/0xd0
 do_cpu_up+0x8b/0xb0
 cpu_up+0x13/0x20
 cpu_subsys_online+0x3d/0x90
 device_online+0x4a/0x90
 online_store+0x89/0xa0
 dev_attr_store+0x18/0x30
 sysfs_kf_write+0x37/0x40
 kernfs_fop_write+0x11c/0x1a0
 __vfs_write+0x18/0x40
 vfs_write+0xb1/0x1a0
 SyS_write+0x55/0xc0
 entry_SYSCALL_64_fastpath+0x1e/0xa9
RIP: 0033:0x7f509fe00054
RSP: 002b:00007ffc402cd4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f509fe00054
RDX: 0000000000000002 RSI: 00000000011f7008 RDI: 0000000000000001
RBP: 0000000000000001 R08: 000000000136b028 R09: 0000000000000001
R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc402cda30 R14: 0000000000000000 R15: 0000000000000000
clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'kvm-clock' wd_now: 332a591e57 wd_last: 1bcddaeb9e mask: ffffffffffffffff clocksource: 'tsc' cs_now: 4597aa80ab cs_last: 45637665bb mask: ffffffffffffffff
tsc: Marking TSC unstable due to clocksource watchdog
KVM setup async PF for cpu 2
kvm-stealtime: cpu 2, msr 43fc8d8c0
Will online and init hotplugged CPU: 2

We think that the following patch introduced this problem:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=855615eee9b1989cac8ec5eaae4562db081a239b

And we noticed the following error message in the VM:
[Firmware Bug]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0x3a (or later)

So, we also tried testing on a host with higher firmware versions.The VM was still stuck.

How can this problem be solved?

Best Regards




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux