Re: KVM ARM: Boot stability issues

Victor Kamensky <victor.kamensky@xxxxxxxxxx> · Sat, 6 Sep 2014 20:00:46 -0700

Hi Chritoffer,

Please see pointer to traces you asked inline.

On 3 September 2014 06:04, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote:
> On Wed, Sep 3, 2014 at 2:37 PM, Diana Craciun
> <diana.craciun@xxxxxxxxxxxxx> wrote:
>> On 09/03/2014 11:38 AM, Christoffer Dall wrote:
>>>
>>> On Wed, Sep 3, 2014 at 12:19 AM, Zbigniew Bodek <zbb@xxxxxxxxxxxx> wrote:
>>>>
>>>> 2014-09-02 9:26 GMT+02:00 Diana Craciun <diana.craciun@xxxxxxxxxxxxx>:
>>>>>
>>>>> On 09/01/2014 05:34 PM, Christoffer Dall wrote:
>>>>>>
>>>>>> On Mon, Sep 01, 2014 at 12:34:09PM +0200, Zbigniew Bodek wrote:
>>>>>>>
>>>>>>> 2014-09-01 12:16 GMT+02:00 Christoffer Dall
>>>>>>> <christoffer.dall@xxxxxxxxxx>:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Monday, September 1, 2014, Zbigniew Bodek <zbb@xxxxxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> 2014-09-01 11:56 GMT+02:00 Zbigniew Bodek <zbb@xxxxxxxxxxxx>:
>>>>>>>>>>
>>>>>>>>>> Hello Christopher,
>>>>>>>>>>
>>>>>>>>>> Please check out my answers in-line.
>>>>>>>>>>
>>>>>>>>>> Best regards
>>>>>>>>>> zbb
>>>>>>>>>>
>>>>>>>>>> 2014-09-01 11:00 GMT+02:00 Christoffer Dall
>>>>>>>>>> <christoffer.dall@xxxxxxxxxx>:
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 26, 2014 at 12:17:17PM +0200, Zbigniew Bodek wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-08-25 21:30 GMT+02:00 Joel Schopp <joel.schopp@xxxxxxx>:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08/25/2014 02:03 PM, Peter Maydell wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 25 August 2014 19:49, Joel Schopp <joel.schopp@xxxxxxx>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm guessing that the patches Peter posted recently might have
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> positive effect on breakpoints and debugging with gdb.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg01291.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, those relate to running a GDB inside the QEMU TCG guest,
>>>>>>>>>>>>>> and won't affect an external gdb. What Zbigniew wants to do
>>>>>>>>>>>>>> is use the QEMU gdbstub to debug the whole guest. For
>>>>>>>>>>>>>> breakpoints to work in that when KVM is enabled requires
>>>>>>>>>>>>>> implementing support for breakpoints and the debug
>>>>>>>>>>>>>> interface in KVM and then the corresponding support for it
>>>>>>>>>>>>>> in QEMU. This is on our todo list (those with cards.linaro.org
>>>>>>>>>>>>>> access can find it as VIRT-116) but I don't think anybody's
>>>>>>>>>>>>>> working on it right now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the clarification.  So many todos, so little time.
>>>>>>>>>>>>
>>>>>>>>>>>> Hello again,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you all very much for your answers.
>>>>>>>>>>>> Actually I use qemu with GDB as described here (-s -S, etc.),
>>>>>>>>>>>> I've
>>>>>>>>>>>> just posted the simplest command that I use to trigger the bug.
>>>>>>>>>>>> So regarding your answers I would like to ask for some more
>>>>>>>>>>>> clarification:
>>>>>>>>>>>> * Has anyone encountered/reproduce this problem on your ARM host?
>>>>>>>>>>>> * Any suggestion where to look in the KVM source for these kind
>>>>>>>>>>>> of
>>>>>>>>>>>> problems?
>>>>>>>>>>>> * Maybe there is some unstable feature that I could disable in
>>>>>>>>>>>> host/guest to narrow this issue?
>>>>>>>>>>>>
>>>>>>>>>>> Just checking:  Are you compiling any of your kernels in Thumb-2
>>>>>>>>>>> mode?
>>>>>>>>>>> I believe the 5250 has a hardware bug with Thumb-2 and Hyp mode.
>>>>>>>>>
>>>>>>>>> No, I am not compiling my kernels in Thumb-2 mode.
>>>>>>>>>
>>>>>>>>>>> Continuous reboots seems to work fine on my TC2.  I can try giving
>>>>>>>>>>> it
>>>>>>>>>>> a
>>>>>>>>>>> test on an Arndale some time in the near future.  Which kernel
>>>>>>>>>>> version
>>>>>>>>>>> (exact commit) are you using for your host and guest?
>>>>>>>>>
>>>>>>>>> I used v3.17-rc1 which is: 7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9
>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, and when you say reboot, do you mean executing 'reboot' in the
>>>>>>>> guest
>>>>>>>> or
>>>>>>>> are you shutting down the qemu process completely and starting a new
>>>>>>>> VM?
>>>>>>>> Do
>>>>>>>> you run any specific workload in the VM?
>>>>>>>>
>>>>>>>>
>>>>>>> I mean to execute 'reboot' command in the guest system, not exit and
>>>>>>> reset qemu process.
>>>>>>> But the issues I encounter can be also triggered when I execute guest
>>>>>>> machine for the first time (so without previous reboots). It is
>>>>>>> totally random behavior.
>>>>>>> I am not running anything in the VM. Just login to busybox/Ubuntu and
>>>>>>> reboot.
>>>>>>>
>>>>>> I just had a loop running on TC2 for hours and it works fine.  Likely
>>>>>> to
>>>>>> be some specific issue with the Arndale board, I'll have a look when I
>>>>>> have a chance to configure my board.
>>>>>
>>>>>
>>>>> I saw this behaviour on Arndale board also. When I saw it I was running
>>>>> an
>>>>> older kernel so it is not something new.
>>>>>
>>>>> Diana
>>>>>
>>>> Hello,
>>>>
>>>> Thanks for the confirmation. I've also tried older kernel and the same
>>>> problems appear.
>>>>
>>>> Does anyone have any clues which areas to investigate in order to debug
>>>> this?
>>>> Thank you all for your help.
>>>>
>>> You should enable the KVM dynamic tracepoints in your host and figure
>>> out where the guest gets stuck and if there's any consistency between
>>> when it happens.
>>>
>>
>> I remember that I enabled tracepoints but it was inconsistent, the problem
>> did not happened in the same place. Also I have noticed that when it
>> returned from guest the guest PC was wrong. It had strange values of 0xc,
>> 0xf...
>>
> Dumping a sequence of these errors in some text file, pastebin or
> something for me to look at would be a great help in starting to track
> down this issue.

I could reproduce the issue on my arndale board. Interestingly
that with 3.17-rc1 guest I was getting first reported issue where
upon issued "reboot" command in guest, qemu did exit and
"load/store instruction decoding not implemented"
message came out. But when I used 3.16 guest I got hang maybe
something lilke second reported issue.

As it was requested I captured several kvm traces as well as associated
materials which could be found on my Linaro google drive [1].

Note tarball is quite big because I included vmlinux for guests so PCs in
traces could be decoded.

Also I was able to reproduce the issue on my TC2 with v3.17-rc2,
it took a bit longer compared to arndale. So I don't think issue is
Arndale specific.

Note I can have dstream/ds-5 attached either to arndale or
TC2 if you want me collect more information at point where
"load/store instruction decoding not implemented" message
is produced I could do it.

For case "load/store instruction decoding not implemented"
here is examples of kvm_guest_faults, guest faulted PCs
do not make much sense to me:

[kamensky@kamensky-w530 kvm_reboot]$ for x in run_guest_3.17-rc1/*/*;
do echo "=== $x ===";  awk -n 'BEGIN { FS="[ ,]+"; ar=0 } /reason
restart/ {ar=1} /kvm_guest_fault/ { if (ar && !and($10, 0x1000000)) {
print $0; } }' $x; done
=== run_guest_3.17-rc1/run1/kvm_trace.1.txt ===
 qemu-system-arm-1773  [000] ...1  1565.358903: kvm_guest_fault: ipa
0x0, hsr 0x90000006, hxfar 0x000004, pc 0x80227dd4
=== run_guest_3.17-rc1/run1/kvm_trace.txt ===
 qemu-system-arm-1740  [001] ...1  1187.653887: kvm_guest_fault: ipa
0x0, hsr 0x90000006, hxfar 0x000004, pc 0x80227dd4
=== run_guest_3.17-rc1/run2/kvm_trace.2.txt ===
 qemu-system-arm-1620  [000] ...1   108.794861: kvm_guest_fault: ipa
0x0, hsr 0x90000046, hxfar 0x000000, pc 0x80010358
=== run_guest_3.17-rc1/run2/kvm_trace.3.txt ===
 qemu-system-arm-1633  [001] ...1   449.329875: kvm_guest_fault: ipa
0x0, hsr 0x90000046, hxfar 0x000000, pc 0x80010358
=== run_guest_3.17-rc1/run2/kvm_trace.4.txt ===
 qemu-system-arm-1647  [001] ...1   707.694854: kvm_guest_fault: ipa
0x0, hsr 0x90000046, hxfar 0x000000, pc 0x80010358
=== run_guest_3.17-rc1/run3/kvm_trace.5.txt ===
 qemu-system-arm-1622  [001] ...1   206.914132: kvm_guest_fault: ipa
0xfffff000, hsr 0x90000045, hxfar 0xfffffff8, pc 0x8000fa78
=== run_guest_3.17-rc1/run3/kvm_trace.6.txt ===
 qemu-system-arm-1639  [000] ...1   556.609131: kvm_guest_fault: ipa
0xfffff000, hsr 0x90000045, hxfar 0xfffffff8, pc 0x8000fa78
[kamensky@kamensky-w530 kvm_reboot]$ arm-linux-gnueabihf-gdb
guest_3.17-rc1/vmlinux
<snip>
Reading symbols from /home/kamensky/kvm_reboot/guest_3.17-rc1/vmlinux...done.
(gdb) x /1i 0x80227dd4
   0x80227dd4 <__put_user_4>:    adds    r12, r0, #3
(gdb) x /1i 0x80010358
   0x80010358 <arch_ptrace+1060>:    ldr    r2, [r0, #388]    ; 0x184
(gdb) x /1i 0x8000fa78
   0x8000fa78 <fpa_get>:    push    {r3, lr}
(gdb)

Also I see other cases when hsr does not have isv bit set,
but it seems those do not go io_mem_abort function because
for those faults 'kvm_is_visible_gfn(vcpu->kvm, gfn)' returns 1.

Also I observed that guest pc reported is the same for any particular
guest reboot of host, but between different reboots of host
different faulted guest PCs are reported. Not sure maybe it is
coincidence. Note in my tarball each runN directory corresponds
on single host boot. It may contain more than one trace.

Also in run_guest_3.16 directory, which as I mention corresponds to
hang on reboot situation PC values are similar to one that Diana
mentioned (0xc) it could be seen how kvm is in loop on those faults.

Thanks,
Victor

[1] https://drive.google.com/file/d/0B_699BvOl4RrOVVSWnhTaVdpS2c/edit?usp=sharing

> -Chritoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@xxxxxxxxxxxxxxxxxxxxx
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm