Re: Book3s_hv KVM HTAB bug

Alexander Graf <agraf@xxxxxxx> · Fri, 14 Jun 2013 01:58:52 +0200

On 14.06.2013, at 01:20, Paul Mackerras wrote:

> On Thu, Jun 13, 2013 at 02:34:56PM +0200, Alexander Graf wrote:
>> Hi Paul,
>> 
>> We've just seen another KVM bug with 3.8 on p7. It looks as if for some reason a bolted HTAB entry for the kernel got evicted.
>> 
> ...
>> (gdb) x /i 0xc000000000005d00
>>   0xc000000000005d00 <instruction_access_common>:	andi.   r10,r12,16384
>> (qemu) xp /i 0x5d00
>>   0x0000000000005d00:  andi.   r10,r12,16384
>> (qemu) info tlb
>>   SLB    ESID                    VSID
>>   3      0xc000000008000000      0x0000c00838795000
>> 
>> So for some reason QEMU can still resolve the virtual address using the guest HTAB, but the the CPU can not. Otherwise the guest wouldn't get a 0x400 when accessing that page.
> 
> When I've seen this sort of thing it has usually been that we failed
> to insert a HPTE in htab_bolt_mapping(), called from
> htab_initialize().  When that happens we BUG_ON(), which is stupid
> because it causes a program interrupt, and the first thing we do is
> turn the MMU on, but we don't have a linear mapping set up, so we
> start taking continual instruction storage interrupts (because the ISI
> handler also wants to turn on interrupts).  Ben has an idea to fix

Ok, that makes sense and sounds like a reasonable possible failure scenario. Unfortunately the guest already got killed and right now everything's running again without any guest hanging.

However, I did forget to also paste the dump of log_buf on my last email. Does that log coincide with what you would expect at this point?

00000000  00 00 00 00 00 00 00 00  00 4c 00 39 00 00 00 37  |.........L.9...7|
00000010  41 6c 6c 6f 63 61 74 65  64 20 39 31 37 35 30 34  |Allocated 917504|
00000020  20 62 79 74 65 73 20 66  6f 72 20 31 30 32 34 20  | bytes for 1024 |
00000030  70 61 63 61 73 20 61 74  20 63 30 30 30 30 30 30  |pacas at c000000|
00000040  30 30 66 66 32 30 30 30  30 00 00 00 00 00 00 00  |00ff20000.......|
00000050  00 00 00 00 00 34 00 21  00 00 00 36 55 73 69 6e  |.....4.!...6Usin|
00000060  67 20 70 53 65 72 69 65  73 20 6d 61 63 68 69 6e  |g pSeries machin|
00000070  65 20 64 65 73 63 72 69  70 74 69 6f 6e 00 00 00  |e description...|
00000080  00 00 00 00 00 00 00 00  00 48 00 37 00 00 00 37  |.........H.7...7|
00000090  50 61 67 65 20 6f 72 64  65 72 73 3a 20 6c 69 6e  |Page orders: lin|
000000a0  65 61 72 20 6d 61 70 70  69 6e 67 20 3d 20 31 36  |ear mapping = 16|
000000b0  2c 20 76 69 72 74 75 61  6c 20 3d 20 31 36 2c 20  |, virtual = 16, |
000000c0  69 6f 20 3d 20 31 32 00  00 00 00 00 00 00 00 00  |io = 12.........|
000000d0  00 24 00 12 00 00 00 36  55 73 69 6e 67 20 31 54  |.$.....6Using 1T|
000000e0  42 20 73 65 67 6d 65 6e  74 73 00 00 00 00 00 00  |B segments......|
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000186a0

> that, which is to have IR and DR off in paca->kernel_msr until we're
> ready to turn the MMU on.  That might help debuggability in the case
> you're hitting, whether or not it's htab_bolt_mapping failing.
> 
> Are you *absolutely* sure that QEMU is using the guest HTAB to
> translate the 0xc... addresses?  If it is actually doing so it would

No, you're right. I got confused with PR KVM. I'm surprised QEMU is able to resolve anything at all really, without access to the HTAB.

But it probably just saw that MSR.DR=0, so it simply used the real mode algorithm to read the data which happened to work correctly, as the virtual address is a valid real mode address as well.

Sorry for the incorrect assumption.

> need to be using the relatively new KVM_PPC_GET_HTAB_FD ioctl, and I
> thought the only place that was used was in the migration code.
> 
> To debug this sort of thing, what I usually do is patch the guest
> kernel to put a branch to self at 0x400.  Then when it hangs you have
> some chance of sorting out what happened using info registers etc.

Now if only it would happen a bit more often ;).

> I would be very interested to know how big a HPT the host kernel
> allocated for the guest and what was in it.  The host kernel prints a
> message telling you the size and location of the HPT, and in this sort

Yes. Unfortunately it doesn't tell me the PID though, so I have a hard time correlating the dmesg output with the VM. However, I'm pretty sure it's this one:

Jun 13 06:31:16 build65 kernel: KVM guest htab at c00000012ae00000 (order 19), LPID 4

That's a 512kb map, right? Sounds too small to me :).

> of situation I find it helpful to take a copy of it with dd and dump
> it with hexdump.

Too late this time around. I'll try to do it next time I see this happening :).

> 
> Also, what page size are you using in the host kernel?  If it's 4k,
> then the guest kernel is limited to using 4k pages for the linear
> mapping, which can mean it runs out of space in the HPT for the linear
> mapping more easily.

In this case the host is running on 64k pages.

> Since you don't have my patch to add a flexible allocator for the HPT
> and RMA areas (you rejected it, if you recall), you'll be limited to
> what you can allocate from the page allocator, which is usually 16MB,
> but may be less if free memory is low and/or fragmented.  16MB should
> be enough for a 3GB guest, particularly if you're using 64k pages in
> the host, but if the host was only able to allocate a much smaller
> HPT, that might explain the problem.

Yeah, sounds like this really is the problem. So what I was asking you back then was to take a look at the dynamic page reshuffling mechanisms that got introduced with CMA and transparent huge pages.

I don't think that preallocating yet another potentially fragmented pool of bigger memory chunks - which your patch did - is the answer to this problem. We just need to defragment normal system memory and delay HPT creation until it's ready. It can't be that hard.

Andrea, do you happen to have any hints for us here? We need roughly 16MB of linear memory available when we create a virtual machine on POWER. Can we somehow tell the mm system to shuffle such a big chunk free for us?

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html