Re: 2.6.35-rc1 regression with pvclock and smp guests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/02/2010 06:10 AM, Arjan Koers wrote:
On 2010-10-02 09:35, Michael Tokarev wrote:
02.10.2010 09:35, Zachary Amsden wrote:
[]
Can you try this patch to see if it helps?  I believe it is also safe
for Xen, but cc'ing to double check.
It makes no visible difference.

For some reason one of my test guests - 2.6.35.6 32bit kernel -
stopped booting completely, always handing at boot somewhere
unless I disable printk.time.  Here's the typical boot messages,
up to the hang:

[    0.000000] Initializing cgroup subsys cpuset
...
[    0.259999] ata1: PATA max MWDMA2 cmd 0x1f0 ctl 0x3f6 bmdma 0xc000 irq 14
[    0.259999] ata2: PATA max MWDMA2 cmd 0x170 ctl 0x376 bmdma 0xc008 irq 15

Note the time - it is constant after switching to kvmclock.
While CPU 1 is booting, pvclock_clocksource_read gets wrong data for that
CPU and returns a value that's far into the future. On subsequent calls, it
keeps returning that bogus 'last' value, because it has been made
to never go backwards in time.

I'm pretty sure that your kernel will boot with this debug patch (for
2.6.35.7). It doesn't fix the problem, but corrects things afterwards.
The patch sets the clock backwards if it detects that the previous
value was far into the future. It also modifies printk to display some
extra information. The DEBUG define was added to get extra calls to
printk's where things can go wrong.



diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 239427c..5eab569 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -120,12 +120,15 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)

  static atomic64_t last_value = ATOMIC64_INIT(0);

+int pvclock_backwards = 0;
+
  cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
  {
  	struct pvclock_shadow_time shadow;
  	unsigned version;
  	cycle_t ret, offset;
  	u64 last;
+	bool backwards;

  	do {
  		version = pvclock_get_time_values(&shadow, src);
@@ -153,13 +156,26 @@ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
  	 * updating at the same time, and one of them could be slightly behind,
  	 * making the assumption that last_value always go forward fail to hold.
  	 */
+	backwards = false;
  	last = atomic64_read(&last_value);
  	do {
-		if (ret<  last)
-			return last;
+		if (ret<  last) {
+			if ( last - ret<  25000000 )
+				return last;
+			else
+				/* The clock will go backwards instead of being stuck at last value for a very long time
+				 * The return value of the previous call to pvclock_clocksource_read was most probably
+				 * very far into te future because the clock for that CPU hadn't been setup yet
+				 */
+				backwards = true;
+		}
  		last = atomic64_cmpxchg(&last_value, last, ret);
  	} while (unlikely(last != ret));

+	/* Increment outside of the while loop, because it always loops twice */
+	if (backwards)
+		pvclock_backwards++;
+
  	return ret;
  }

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0bf2ece..d6dcd45 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1,3 +1,5 @@
+#define DEBUG
+
  /*
   *	x86 SMP booting functions
   *
diff --git a/kernel/printk.c b/kernel/printk.c
index 444b770..9608bec 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -687,6 +687,8 @@ static inline void printk_delay(void)
  	}
  }

+extern int pvclock_backwards;
+
  asmlinkage int vprintk(const char *fmt, va_list args)
  {
  	int printed_len = 0;
@@ -778,9 +780,13 @@ asmlinkage int vprintk(const char *fmt, va_list args)
  				unsigned long long t;
  				unsigned long nanosec_rem;

+				int pvclock_backwards_prev = pvclock_backwards;
  				t = cpu_clock(printk_cpu);
  				nanosec_rem = do_div(t, 1000000000);
-				tlen = sprintf(tbuf, "[%5lu.%06lu] ",
+				tlen = sprintf(tbuf, "[%d;%d/%d:%5lu.%06lu] ",
+						printk_cpu,
+						pvclock_backwards_prev,
+						pvclock_backwards,
  						(unsigned long) t,
  						nanosec_rem / 1000);




Partial output on my machine, where the clock is set backwards 4 times:
...
[0;0/0:    0.015662] CPU0: AMD Athlon(tm) II X2 240 Processor stepping 02
[0;0/0:    0.124164] ++++++++++++++++++++=_---CPU UP  1
[0;0/0:    0.124193] Booting Node   0, Processors  #1 Ok.
[0;0/0:    0.124602] Setting warm reset code and vector.
[0;0/0:    0.124609] 1.
[0;0/0:    0.124610] 2.
[0;0/0:    0.124611] 3.
[0;0/0:    0.124624] Asserting INIT.
[0;0/0:    0.124634] Waiting for send to finish...
[0;0/0:    0.134508] Deasserting INIT.
[0;0/0:    0.134515] Waiting for send to finish...
[0;0/0:    0.134519] #startup loops: 2.
[0;0/0:    0.134521] Sending STARTUP #1.
[0;0/0:    0.134527] After apic_write.
[1;0/0:    0.008000] CPU#1 (phys ID: 1) waiting for CALLOUT
[0;0/1:    0.134838] Startup point 1.
[0;1/1:    0.134841] Waiting for send to finish...
[0;1/1:    0.135049] Sending STARTUP #2.
[0;1/1:    0.135055] After apic_write.
[0;1/1:    0.135359] Startup point 1.
[0;1/1:    0.135361] Waiting for send to finish...
[0;1/1:    0.135568] After Startup.
[0;1/1:    0.135569] Before Callout 1.
[0;1/1:    0.135571] After Callout 1.
[1;1/1:    0.008000] CALLIN, before setup_local_APIC().
[1;2/2:    0.008000] Stack at about ffff88001f875f44
[0;3/3:    0.136176] CPU1: has booted.
[1;3/3:    0.008000] kvm-clock: cpu 1, msr 0:1511c41, secondary cpu clock
[0;4/4:    0.136199] Brought up 2 CPUs
[0;4/4:    0.136201] Boot done.
[0;4/4:    0.136202] Before bogomips.
[0;4/4:    0.136204] Total of 2 processors activated (11198.56 BogoMIPS).
[0;4/4:    0.136205] Before bogocount - setting activated=1.
[1;4/4:    0.140208] x86 PAT enabled: cpu 1, old 0x0, new 0x7010600070106
[0;4/4:    0.142577] NET: Registered protocol family 16
[0;4/4:    0.144263] PCI: Using configuration type 1 for base access
[0;4/4:    0.144494] PCI: Using configuration type 1 for extended access
[0;4/4:    0.144938] mtrr: your CPUs had inconsistent variable MTRR settings
[0;4/4:    0.144938] mtrr: your CPUs had inconsistent MTRRdefType settings
[0;4/4:    0.144938] mtrr: probably your BIOS does not setup all CPUs.
[0;4/4:    0.148004] mtrr: corrected configuration.
[0;4/4:    0.156040] bio: create slab<bio-0>  at 0
[0;4/4:    0.156602] vgaarb: loaded
[0;4/4:    0.156602] PCI: Probing PCI hardware
[0;4/4:    0.156602] PCI: Probing PCI hardware (bus 00)
[0;4/4:    0.156703] pci 0000:00:01.1: reg 20: [io  0xc000-0xc00f]
[0;4/4:    0.160269] pci 0000:00:01.3: quirk: [io  0xb000-0xb03f] claimed by PIIX4 ACPI
[0;4/4:    0.161055] pci 0000:00:01.3: quirk: [io  0xb100-0xb10f] claimed by PIIX4 SMB
[0;4/4:    0.164064] pci 0000:00:02.0: reg 10: [mem 0xf0000000-0xf1ffffff pref]
[0;4/4:    0.164827] pci 0000:00:02.0: reg 14: [mem 0xf2000000-0xf2000fff]
[0;4/4:    0.169023] pci 0000:00:03.0: reg 10: [io  0xc020-0xc03f]
[0;4/4:    0.170052] pci 0000:00:03.0: reg 14: [mem 0xf2001000-0xf2001fff]
[0;4/4:    0.170381] pci 0000:00:04.0: reg 10: [io  0xc040-0xc05f]
[0;4/4:    0.170765] pci 0000:00:05.0: reg 10: [io  0xc080-0xc0bf]
[0;4/4:    0.171023] pci 0000:00:06.0: reg 10: [io  0xc0c0-0xc0ff]
[0;4/4:    0.172123] vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none
[0;4/4:    0.172971] pci 0000:00:01.0: PIIX/ICH IRQ router [8086:7000]
[0;4/4:    0.172971] PCI: pci_cache_line_size set to 64 bytes
[0;4/4:    0.172971] reserve RAM buffer: 000000000009bc00 - 000000000009ffff
[0;4/4:    0.172971] reserve RAM buffer: 000000001fffd000 - 000000001fffffff
[0;4/4:    0.176175] Switching to clocksource kvm-clock
[1;4/4:    0.212494] pci_bus 0000:00: resource 0 [io  0x0000-0xffff]
[1;4/4:    0.212500] pci_bus 0000:00: resource 1 [mem 0x00000000-0xffffffffffffffff]
[1;4/4:    0.212828] NET: Registered protocol family 2
[1;4/4:    0.213783] IP route cache hash table entries: 4096 (order: 3, 32768 bytes)
...

Umm... do you guys have this commit? This is supposed to address the issue where the guest keeps resetting the TSC. A guest which does that will break kvmclock. It only happens on SMP, and it's much worse on AMD CPUs...

sound like your scenario.

commit bd59fc8ff95126f27b7a0df1b6cc602aa428812d
Author: Zachary Amsden <zamsden@xxxxxxxxxx>
Date:   Thu Aug 19 22:07:26 2010 -1000

    KVM: x86: Robust TSC compensation

    Make the match of TSC find TSC writes that are close to each other
    instead of perfectly identical; this allows the compensator to also
    work in migration / suspend scenarios.

    Signed-off-by: Zachary Amsden <zamsden@xxxxxxxxxx>
    Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux