Re: windows 2008 guest causing rcu_shed to emit NMI

Andrey Korolyov <andrey@xxxxxxx> · Tue, 29 Jan 2013 02:35:02 +0300

On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
> On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote:
>>> On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>>> > On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote:
>>> >> On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>>> >> > On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote:
>>> >> >> Thank you Marcelo,
>>> >> >>
>>> >> >> Host node locking up sometimes later than yesterday, bur problem still
>>> >> >> here, please see attached dmesg. Stuck process looks like
>>> >> >> root     19251  0.0  0.0 228476 12488 ?        D    14:42   0:00
>>> >> >> /usr/bin/kvm -no-user-config -device ? -device pci-assign,? -device
>>> >> >> virtio-blk-pci,? -device
>>> >> >>
>>> >> >> on fourth vm by count.
>>> >> >>
>>> >> >> Should I try upstream kernel instead of applying patch to the latest
>>> >> >> 3.4 or it is useless?
>>> >> >
>>> >> > If you can upgrade to an upstream kernel, please do that.
>>> >> >
>>> >>
>>> >> With vanilla 3.7.4 there is almost no changes, and NMI started firing
>>> >> again. External symptoms looks like following: starting from some
>>> >> count, may be third or sixth vm, qemu-kvm process allocating its
>>> >> memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. Patch
>>> >> helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to
>>> >> kill stuck kvm processes and node returned back to the normal, when on
>>> >> 3.2 sending SIGKILL to the process causing zombies and hanged ``ps''
>>> >> output (problem and workaround when no scheduler involved described
>>> >> here http://www.spinics.net/lists/kvm/msg84799.html).
>>> >
>>> > Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module parameter.
>>> >
>>>
>>> Hi Marcelo,
>>>
>>> thanks, this parameter helped to increase number of working VMs in a
>>> half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10
>>> to 15 percents, persists on such numbers for a long time, where linux
>>> guests in same configuration do not jump over one percent even under
>>> stress bench. After I disabled HT, crash happens only in long runs and
>>> now it is kernel panic :)
>>> Stair-like memory allocation behaviour disappeared, but other symptom
>>> leading to the crash which I have not counted previously, persists: if
>>> VM count is ``enough'' for crash, some qemu processes starting to eat
>>> one core, and they`ll panic system after run in tens of minutes in
>>> such state or if I try to attach debugger to one of them. If needed, I
>>> can log entire crash output via netconsole, now I have some tail,
>>> almost the same every time:
>>> http://xdel.ru/downloads/btwin.png
>>
>> Yes, please log entire crash output, thanks.
>>
>
> Here please, 3.7.4-vanilla, 16 vms, ple_gap=0:
>
> http://xdel.ru/downloads/oops-default-kvmintel.txt

Just an update: I was able to reproduce that on pure linux VMs using
qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at
start of vm(with count ten working machines at the moment). Qemu-1.1.2
generally is not able to reproduce that, but host node with older
version crashing on less amount of Windows VMs(three to six instead
ten to fifteen) than with 1.3, please see trace below:

http://xdel.ru/downloads/oops-old-qemu.txt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html