Re: windows 2008 guest causing rcu_shed to emit NMI

Andrey Korolyov <andrey@xxxxxxx> · Mon, 28 Jan 2013 00:04:50 +0300

On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote:
>> On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> > On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote:
>> >> Thank you Marcelo,
>> >>
>> >> Host node locking up sometimes later than yesterday, bur problem still
>> >> here, please see attached dmesg. Stuck process looks like
>> >> root     19251  0.0  0.0 228476 12488 ?        D    14:42   0:00
>> >> /usr/bin/kvm -no-user-config -device ? -device pci-assign,? -device
>> >> virtio-blk-pci,? -device
>> >>
>> >> on fourth vm by count.
>> >>
>> >> Should I try upstream kernel instead of applying patch to the latest
>> >> 3.4 or it is useless?
>> >
>> > If you can upgrade to an upstream kernel, please do that.
>> >
>>
>> With vanilla 3.7.4 there is almost no changes, and NMI started firing
>> again. External symptoms looks like following: starting from some
>> count, may be third or sixth vm, qemu-kvm process allocating its
>> memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. Patch
>> helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to
>> kill stuck kvm processes and node returned back to the normal, when on
>> 3.2 sending SIGKILL to the process causing zombies and hanged ``ps''
>> output (problem and workaround when no scheduler involved described
>> here http://www.spinics.net/lists/kvm/msg84799.html).
>
> Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module parameter.
>

Hi Marcelo,

thanks, this parameter helped to increase number of working VMs in a
half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10
to 15 percents, persists on such numbers for a long time, where linux
guests in same configuration do not jump over one percent even under
stress bench. After I disabled HT, crash happens only in long runs and
now it is kernel panic :)
Stair-like memory allocation behaviour disappeared, but other symptom
leading to the crash which I have not counted previously, persists: if
VM count is ``enough'' for crash, some qemu processes starting to eat
one core, and they`ll panic system after run in tens of minutes in
such state or if I try to attach debugger to one of them. If needed, I
can log entire crash output via netconsole, now I have some tail,
almost the same every time:
http://xdel.ru/downloads/btwin.png
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html