Re: KVM/ARM status and branches

Christoffer Dall <c.dall@xxxxxxxxxxxxxxxxxxxxxx> · Tue, 11 Sep 2012 00:08:46 -0400

On Mon, Sep 10, 2012 at 4:59 PM, Alexander Graf <agraf@xxxxxxx> wrote:
>
>
> On 10.09.2012, at 22:07, Christoffer Dall <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>> On Mon, Sep 10, 2012 at 4:04 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>>> On Mon, 10 Sep 2012 10:32:04 -0400, Christoffer Dall
>>> <c.dall@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>> On Mon, Sep 10, 2012 at 6:18 AM, Marc Zyngier <marc.zyngier@xxxxxxx>
>>> wrote:
>>>>> On 10/09/12 05:04, Christoffer Dall wrote:
>>>>>> Hello,
>>>>>>
>>>>>> We have a new branch, which will never be rebased and should always be
>>>>>> bisectable and mergable. It's kvm-arm-master and can be found here:
>>>>>>
>>>>>> git://github.com/virtualopensystems/linux-kvm-arm.git kvm-arm-master
>>>>>>
>>>>>> (or pointy-clicky web interface:)
>>>>>> https://github.com/virtualopensystems/linux-kvm-arm
>>>>>>
>>>>>> This branch merges 3.6-rc5
>>>>>>
>>>>>> The branch also merges all Marc Zyngier's timer, vgic and hyp-mode
>>>>>> boot branches.
>>>>>>
>>>>>> It is also merged with the IRQ injection API changes (touched
>>>>>> KVM_IRQ_LINE) as there hasn't been any other comments on this. This
>>>>>> requires qemu patches, which can be found here:
>>>>>>
>>>>>> git://github.com/virtualopensystems/qemu.git kvm-arm-irq-api
>>>>>>
>>>>>> (or pointy-clicky web interface:)
>>>>>> https://github.com/virtualopensystems/qemu
>>>>>>
>>>>>> Two things are outstanding on my end before I attempt an initial
>>>>>> upstream;
>>>>>> 1. We have a bug when we start swapping in the host, the guest kernel
>>>>>> dies with "BUG: Bad page state..." and all sort of bad things follow.
>>>>>> If we really stress the host on memory pressure it seems that host can
>>>>>> also crash, or at least become completely unresponsive. The same test
>>>>>> on a KVM kernel without any VMs does not cause this BUG.
>>>>>
>>>>> Is that the one you're seeing?
>>>>>
>>>>> [  312.189234] ------------[ cut here ]------------
>>>>> [  312.203056] kernel BUG at arch/arm/kvm/mmu.c:382!
>>>>> [  312.217134] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP THUMB2
>>>>> [  312.235376] Modules linked in:
>>>>> [  312.244515] CPU: 0    Not tainted  (3.6.0-rc3+ #40)
>>>>> [  312.259118] PC is at stage2_clear_pte+0x128/0x134
>>>>> [  312.273193] LR is at kvm_unmap_hva+0x97/0xa0
>>>>> [  312.285967] pc : [<c001e10c>]    lr : [<c001ee0f>]    psr: 60000133
>>>>> [  312.285967] sp : caa25998  ip : df97a028  fp : 00800000
>>>>> [  312.320355] r10: 873b5b5f  r9 : c8654000  r8 : 01c55000
>>>>> [  312.335990] r7 : 00000000  r6 : df249c00  r5 : c688fb80  r4 :
>>> df249ccc
>>>>> [  312.355532] r3 : 00000000  r2 : 2e001000  r1 : 00000000  r0 :
>>> 00000000
>>>>> [  312.375076] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA Thumb
>>>>> Segment user
>>>>> [  312.396962] Control: 70c5387d  Table: 8a9bbb00  DAC: fffffffd
>>>>> [  312.414161] Process hackbench (pid: 7207, stack limit = 0xcaa242f8)
>>>>>
>>>>
>>>> FYI, this is what I'm seeing in the guest in more details (this
>>>> couldn't be the icache stuff could it?):
>>>
>>> [...]
>>>
>>> I do see similar things - and some others. It is really random.
>>>
>>> I tried nuking the icache without any success. I spent the whole day
>>> adding flushes on every code paths, without making a real difference. And
>>> the more I think of it, the more I'm convinced that this is caused by the
>>> way we manipulate pages without telling the kernel what we're actually
>>> doing.
>>>
>>> What happens is that as far as the kernel is concerned, the qemu pages are
>>> always clean. We never flag a page dirty, because it is the guest that
>>> performs the write, and we're completely oblivious of that path. What I
>>> think happens is that the guest writes some data to the cache (or even to
>>> memory) and the underlying page gets evicted without being sync-ed first,
>>> because nobody knows it's been modified.
>>>
>>> If my gut feeling is true, we need to tell the kernel that as soon as a
>>> page is inserted in stage-2, it is assumed to be dirty. We could always
>>> mark them read-only and resolve the fault at a later time, but that isn't
>>> important at the moment. And we need to flag it in the qemu mapping,
>>> because it is the one being evicted.
>>>
>>> What do you think?
>>>
>> I think this is definitely a good bet, I remember Alex Graf saying
>> something about KVM taking care of the dirty bit for us, but I'm not
>> sure.
>
> There is a kvm helper function to mark a gfn dirty. You need to call that one :).
>
>>
>> We already mark pages read-only if that makes sense, so we could avoid
>> setting the dirty bit there.
>
> You may want to use the dirty bit for vga diety bitmap information, so yes, doing it on demand makes sense.
>

thanks, so it turns out that *both* fixes were needed, so we were in
fact seeing the infamous icache bug, and of course the dcache bug was
going to blow things up as well.

tested with three VMs, all using 400MB (/dev/random > <ramfs>/foo),
all running cyclictest and hackbench, pressuring the memory on host,
see a lot of swapping happening, everything still works, then I ran
KSM on there, didn't amount to all that much, thing still working,
replaced all the /dev/random stuff with /dev/zero, ran KSM again, saw
all the memory being swapped back in and being freed, wrote a few
pages in the ramfs files in two guests to break COW, everything still
running beautifully for more than 45 minutes. I'm happy at this point,
see patch in separate e-mail.

-Christoffer
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm