On Mon, Mar 28, 2011 at 08:02:47PM +0200, Avi Kivity wrote: > On 03/28/2011 07:54 PM, Andrea Arcangeli wrote: > > BTW, is it genuine that a protection fault is generated instead of a page > > fault while dereferencing address 0x00008805d6b087f8? I would normally > > except a page fault from a memory dereference that doesn't alter > > processor state/segments. > > Yes. Bits 48-63 of the address must be equal to bit 47, or a #GP is > generated (non-canonical address). Ok, when you said 16 bit reversed I didn't match it to bit 48 and max 128TB of user address space. I thought it was good idea to check because in the past I've seen GFP that were hardware issues triggering on normal memory dereference but this is probably not the case. Tomasz, how easily can you reproduce? Could you upload to the site the output of objdump -dr arch/x86/kvm/mmu.o too? (my assembly is vastly different than the one shown so far, I may find more info in the oops if I get the assembly of the caller too and of the iteration of the loop that runs in that function before the GFP) khugepaged is present in your second trace (and khugepaged is mangling over some memslot range with guest gfn mapped or kvm_unmap_rmapp wouldn't be called in the first place, hope the memslot are all ok) but probably you didn't get the right alignment so likely the THP are mapped as 4k pages in the guest, which must work fine too. I wonder if that might be related to that (my qemu-kvm I keep it patched with the patch below which isn't yet polished enough to be digestible for qemu, wrong alignments, x86 4M alignment not handled yet, and not sure if the DONTFORK fix to prevent OOM with hotplug/migrate is acceptable in that position). Can you try to "echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs" and then run "cat /proc/`pgrep qemu`/smaps >/dev/null" once per minute (or find the right pid by hand if you've more than one qemu process running). This debug trick will only work for 2.6.38.1, as 2.6.39 has a native THP handling in the smaps file, but in 2.6.38.1 it should flush all sptes mapped on THP just like fork (this might help to reproduce). I'm also surprised this happened during fork that initialize the tap interface, shouldn't that fork run before any sptes is established? (we're running the spte invalidate with mmu notifier in the parent before wrprotecting the ptes during fork) I also wonder if it's a memslot race of some kind, I don't see anything wrong in the rmapp handling at the moment. This isn't a patch to try, I'm only showing it here for reference as I guess I suspect it might hide the bug. I'm now going to reverse it and see if I can reproduce, in case having large sptes (instead of 4k sptes) always mapped on host THP changes something. Thanks! diff --git a/exec.c b/exec.c index bb0c1be..f60e5fe 100644 --- a/exec.c +++ b/exec.c @@ -2856,6 +2856,18 @@ static ram_addr_t last_ram_offset(void) return last; } +#if defined(__linux__) && defined(__x86_64__) +/* + * Align on the max transparent hugepage size so that + * "(gfn ^ pfn) & (HPAGE_SIZE-1) == 0" to allow KVM to + * take advantage of hugepages with NPT/EPT or to + * ensure the first 2M of the guest physical ram will + * be mapped by the same hugetlb for QEMU (it is worth + * it even without NPT/EPT). + */ +#define PREFERRED_RAM_ALIGN (2*1024*1024) +#endif + ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, ram_addr_t size, void *host) { @@ -2902,9 +2914,15 @@ ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, PROT_EXEC|PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); #else - new_block->host = qemu_vmalloc(size); +#ifdef PREFERRED_RAM_ALIGN + if (size >= PREFERRED_RAM_ALIGN) + new_block->host = qemu_memalign(PREFERRED_RAM_ALIGN, size); + else +#endif + new_block->host = qemu_vmalloc(size); #endif qemu_madvise(new_block->host, size, QEMU_MADV_MERGEABLE); + qemu_madvise(new_block->host, size, QEMU_MADV_DONTFORK); } } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html