Re: 2.6.35-rc1 regression with pvclock and smp guests

Arjan Koers <0h61vkll2ly8@xxxxxxxxxx> · Sat, 31 Jul 2010 13:53:45 +0200

On 2010-07-31 04:39, Zachary Amsden wrote:
> On 07/30/2010 02:34 PM, Arjan Koers wrote:
>> On 2010-07-28 12:37, Avi Kivity wrote:
>>   
>>>   On 07/28/2010 12:00 AM, Arjan Koers wrote:
>>>     
>>>> On 2010-07-26 20:59, Arjan Koers wrote:
>>>>
>>>>       
>>>>> I ran into the same problem. 2.6.34.1 and 2.6.35-rc6 SMP guest
>>>>> kernels hang during boot.
>>>>>          
>>>> It appears that last is way ahead of ret twice.
>>>> The kernel boots with this debug patch that makes the clock go
>>>> backwards if the difference is big:
>>>>
>>>>        last = atomic64_read(&last_value);
>>>>        do {
>>>> -        if (ret<   last)
>>>> -            return last;
>>>> +        if (ret<   last) {
>>>> +            if ( last - ret<   25000000 )
>>>> +                return last;
>>>> +            else
>>>> +                printk("pvclock backwards: ret = %llx; last =
>>>> %llx\n", ret, last);
>>>> +        }
>>>>            last = atomic64_cmpxchg(&last_value, last, ret);
>>>>        } while (unlikely(last != ret));
>>>>
>>>>
>>>>
>>>> [    0.037122] Total of 2 processors activated (11198.08 BogoMIPS).
>>>> [    0.037118] x86 PAT enabled: cpu 1, old 0x0, new 0x7010600070106
>>>> [    0.040000] pvclock backwards: ret = 108373705fe2; last =
>>>> 210aff61470a
>>>>        
>>> Zaaaacchhhh?!
>>>
>>>      
>>
>> The lists below show some debug data of the first 99 calls to
>> pvclock_clocksource_read since the kernel booted. The situation
>> after the 'do ... while (version != src->version)' loop is
>> displayed.
>>
>> Meaning of the columns:
>> - src pointer
>> - shadow.tsc_timestamp
>> - shadow.system_timestamp
>> - shadow.version
>> - native_read_tsc()
>> - delta = native_read_tsc() - shadow.tsc_timestamp
>> - offset = scale_delta(delta, shadow.tsc_to_nsec_mul, shadow.tsc_shift)
>> - ret = shadow.system_timestamp + offset
>>
>> Fields left out, because they were the same for all rows:
>> - shadow.tsc_to_nsec_mul: b6dc43b6
>> - shadow.tsc_shift: ffffffff
>> - shadow.flags: 0
>>
>> Debug log of guest after cold boot of virtual machine:
<snip>
>> 70: ffff880001411c00     232f4a54e b42c0e25f5e7   c17c    
>> 232f63f61         19a13          9274  b42c0e26885b
>> 71: ffff880001511c00     20afec946 b42bffe0b604    130 1f890681eacdf
>> 1f88e5d1fe399  b433ab005565 1685faae10b69
>>    
> 
> Okay, I think I know what's going on and why Glauber's patch causes
> problems for you.  It looks like your kernel is reading the kvmclock on
> the AP before it is initialized.  Looking at the guest side of things,
> it seems entirely plausible this could happen.
> 
> You did mention printk timing causes the bug to appear?  Perhaps it is
> not just coincidental.  Printk getting the time might very well call
> back into the timer code before the clock is initialized, and you've got
> tons of stuff in cpu_init and friends that are likely to want to printk
> all kinds of bootup messages.
> 
> If this were in fact the case, the cmpxchg that was added by Glauber's
> patch could leap your clock forward to some very uninitialized random
> value and then you could end up stuck in a timeout loop for days, as you
> are seeing.

Yes. That large wrong value is stored in last_value and all future correct
values are ignored, because they are smaller then last_value.

> Can you try very simply disabling printk timing to see if that might be
> the source of the bug?  In the meantime, what kernel version do you have
> in the guests?

The kernel boots successfully when CONFIG_PRINTK_TIME is not set.

I'm testing with 2.6.35-rc6 now. The problem also occurs with 2.6.34.1,
which also has Glauber's patch. Version 2.6.34 is working.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html