Re: KVM guest crashes

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Sat, 24 Jan 2009 11:06:01 -0200

On Sat, Jan 24, 2009 at 08:42:06AM +0100, Alexander Graf wrote:
>> rarely now). You can use the no_timer_check kernel option to bypass  
>> it.
>
> Ok :-). Thanks. The logic in the kernel for this is really stupid  
> (basing timing on clock speed). What about disabling the check if we  
> detect KVM?

Yes, this is an option. We've talked about it before, but no patch was
merged. The RHEL5.3 kernel skips those checks when it detects VMWare 
or KVM hypervisors.

We should understand what is happening to fix the fullvirt/old guest
case. For the in-kernel PIT, I believe there is a bug somewhere, either
in PIT itself or in the interaction with IOAPIC (failure to inject
interrupts for some reason). I started debugging it by constantly
reboot'ing an SMP guest but my testbox died. Hope to get back to it
soon.

>> Regarding the corruption problem, I have a few questions:
>>
>> - It is SMP specific (ie both kernel/userspace irqchip fail).
>> 	- which means UP guests are stable with both kernel/user
>> 	  irqchip.
>
> I have not been able to reproduce any of my issues with UP. I have to  
> admit that I only tried UP with in-kernel irqchip.

OK.

>> The "Stuck ??" messages seem to be coming from smpboot.c. So for some
>> reason vcpu's are being reset. Don't seem to be a triple fault because
>> in that case all vcpu's would be reset (so yes, the vcpu was really on
>> BIOS code).
>
> Hm. I know that OSX turns off CPUs it doesn't need as an alternative to 
> deep-sleep. Does Linux do that too?

Not that I know of, unless you offline CPU's manually, which does not
seem to be the case.

>> Suggest the following:
>> - Confirm the problem happens with root on ext3 filesystem (can't you
>>  mount the CIFS and copy the data over to a local guest disk to
>>  simulate similar load?).
>
> I had Stuck ?? messages without networking, but if it helps I can try  
> that too. In the project we're using this for we do things over cifs, so 
> that's why I built the test case around it.

OK. Just trying to decrease the variables involved. I'll setup a machine
to run a similar load next week.

>> - Check that the kernel text is not corrupted. Save the "good" kernel
>>  text with QEMU's "pmemsave" or "memsave" (you can see start/end in
>>  the symbols _text/_etext, /proc/kallsyms) after booting. After you
>>  see the crash, save the "bad" kernel text, compare. This can give
>>  additional clues (or not).
>
> Good idea - I'll try.
>
>> Also, you mentioned "other reports" previously, can you point to them,
>> please?
>
> Yes, will do later. I gotta run now! Thanks for the reply - it's good to 
> know this isn't getting ignored :-).

Have a good weekend.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html