With current 3.11 kernels we got reports of nested qemu failing in weird ways. I believe 3.10 also had issues before. Not sure whether those were the same. With 3.8 based kernels (close to current stable) I found no such issues. It is possible to reproduce things with the following setup: Host 64bit user-space, kernel 3.8 or 3.11 based, 64bit hypervisor, Haswell CPU masked to core2duo for 1rst, 4-core, 8G memory 1rst 32 or 64bit user-space, kernel 3.8 or 3.11 based, virtio net and block device, swap, 64bit hypervisor, 2-vcpu, 2G memory 2nd 32bit user-space, kernel 3.8 based user network stack virtio, virtio block device, 2-vcpu, 1G memory Test is basically to start the 2nd level guest from a base raw image file and perform some package updates and install some new packages through ssh inside. With a 3.8 kernel running on the host the host logs some attempted (and likely ignored) MSR accesses (caused by masking the vcpu to core2duo) but the install in the 2nd level succeeds, except when the 1rst level runs a 32bit userspace. In that case I could observe the 1rst level qemu process to use a lot of cpu time but the 2nd level showed signs of soft-lockup. Maybe at least one of the second level vcpus not getting scheduled anymore? Switching the host kernel to 3.11 (about -rc4) the 2nd level install fails with various symptoms. For a 32bit user-space it looks like the previously described lockup. Though I observed NMI reason 21 and 31 messages as well. With a 3.8 kernel and 64bit user-space in 1rst level there were the NMI messages again but a double fault crash in 2nd level which seemed to have cmos_interrupt function on the start of the stack. Using a 3.11-rc4 64bit user-space in 1rst level only had the double fault without the NMI messages. The symptoms could vary but the ones described above were the most likely with a given combination. I also tried 3.11 with 64bit userspace on host and 1rst level while not doing any cpu masking (so 1rst level sees a qemu 64bit vcpu). This got rid of the msr messages but otherwise would make the 2nd level get stuck without any messages. Sometimes with 1rst level busy sometimes not. Though, except for very rare cases where things really went bad and in one case took down the host, the 2nd level guest can be killed by ctrl-c from the 1rst level guest. Now I am not sure which way to debug this better. Has anybody seem similar things or can help me with some advice on how to get more information? Thanks, Stefan Please cc me on replies as I am not subscribed.
Attachment:
signature.asc
Description: OpenPGP digital signature