Re: 2.6.38.1 general protection fault

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Mon, 28 Mar 2011 22:04:01 +0200

On Mon, Mar 28, 2011 at 08:02:47PM +0200, Avi Kivity wrote:
> On 03/28/2011 07:54 PM, Andrea Arcangeli wrote:
> > BTW, is it genuine that a protection fault is generated instead of a page
> > fault while dereferencing address 0x00008805d6b087f8? I would normally
> > except a page fault from a memory dereference that doesn't alter
> > processor state/segments.
> 
> Yes.  Bits 48-63 of the address must be equal to bit 47, or a #GP is 
> generated (non-canonical address).

Ok, when you said 16 bit reversed I didn't match it to bit 48 and max
128TB of user address space. I thought it was good idea to check
because in the past I've seen GFP that were hardware issues triggering
on normal memory dereference but this is probably not the case.

Tomasz, how easily can you reproduce? Could you upload to the site the
output of objdump -dr arch/x86/kvm/mmu.o too? (my assembly is vastly
different than the one shown so far, I may find more info in the oops
if I get the assembly of the caller too and of the iteration of the
loop that runs in that function before the GFP)

khugepaged is present in your second trace (and khugepaged is mangling
over some memslot range with guest gfn mapped or kvm_unmap_rmapp
wouldn't be called in the first place, hope the memslot are all ok)
but probably you didn't get the right alignment so likely the THP are
mapped as 4k pages in the guest, which must work fine too. I wonder if
that might be related to that (my qemu-kvm I keep it patched with the
patch below which isn't yet polished enough to be digestible for qemu,
wrong alignments, x86 4M alignment not handled yet, and not sure if
the DONTFORK fix to prevent OOM with hotplug/migrate is acceptable in
that position).

Can you try to "echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs"
and then run "cat /proc/`pgrep qemu`/smaps >/dev/null" once per minute (or find
the right pid by hand if you've more than one qemu process running).
This debug trick will only work for 2.6.38.1, as 2.6.39 has a native
THP handling in the smaps file, but in 2.6.38.1 it should flush all
sptes mapped on THP just like fork (this might help to reproduce).

I'm also surprised this happened during fork that initialize the tap
interface, shouldn't that fork run before any sptes is established?
(we're running the spte invalidate with mmu notifier in the parent
before wrprotecting the ptes during fork)

I also wonder if it's a memslot race of some kind, I don't see
anything wrong in the rmapp handling at the moment.

This isn't a patch to try, I'm only showing it here for reference as I
guess I suspect it might hide the bug. I'm now going to reverse it and
see if I can reproduce, in case having large sptes (instead of 4k
sptes) always mapped on host THP changes something.

Thanks!

diff --git a/exec.c b/exec.c
index bb0c1be..f60e5fe 100644
--- a/exec.c
+++ b/exec.c
@@ -2856,6 +2856,18 @@ static ram_addr_t last_ram_offset(void)
     return last;
 }
 
+#if defined(__linux__) && defined(__x86_64__)
+/*
+ * Align on the max transparent hugepage size so that
+ * "(gfn ^ pfn) & (HPAGE_SIZE-1) == 0" to allow KVM to
+ * take advantage of hugepages with NPT/EPT or to
+ * ensure the first 2M of the guest physical ram will
+ * be mapped by the same hugetlb for QEMU (it is worth
+ * it even without NPT/EPT).
+ */
+#define PREFERRED_RAM_ALIGN (2*1024*1024)
+#endif
+
 ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
                                    ram_addr_t size, void *host)
 {
@@ -2902,9 +2914,15 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
                                    PROT_EXEC|PROT_READ|PROT_WRITE,
                                    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
 #else
-            new_block->host = qemu_vmalloc(size);
+#ifdef PREFERRED_RAM_ALIGN
+	    if (size >= PREFERRED_RAM_ALIGN)
+		    new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size);
+	    else
+#endif
+		    new_block->host = qemu_vmalloc(size);
 #endif
             qemu_madvise(new_block->host, size, QEMU_MADV_MERGEABLE);
+            qemu_madvise(new_block->host, size, QEMU_MADV_DONTFORK);
         }
     }
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html