On Fri, Jun 24, 2016 at 03:10:03PM +0200, Paolo Bonzini wrote: > On 24/06/2016 15:04, Quentin Casasnovas wrote: > > On Thu, Jun 23, 2016 at 06:03:01PM +0200, Paolo Bonzini wrote: > >> > >> > >> On 18/06/2016 11:01, Quentin Casasnovas wrote: > >>> Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software > >>> Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine > >>> Control Structure", I found that we're enforcing that the destination > >>> operand is NOT located in a read-only data segment or any code segment when > >>> the L1 is in long mode - BUT that check should only happen when it is in > >>> protected mode. > >>> > >>> Shuffling the code a bit to make our emulation follow the specification > >>> allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests > >>> without problems. > >> > >> That's great, and I'm applying the patch, but it's also pretty weird. :) > >> Do you have a pointer to Xen source code that does a VMREAD into a > >> read-only data segment or a code segment? > > > > It is indeed pretty weird. Looking at the Xen stack trace, it looks like > > the vmread is writing to an on-stack buffer, and surely it must be writable > > so I wonder if Xen might not be using an executable stack for some reason? > > That would be a bit scary so I'm surely missing something. > > > > Is there an easy way to know from my KVM host the different segment > > permission setup by the guest? > > Remove your patch, call dump_vmcs() where the #GP is injected, and > you'll find the VMCS (including segment permissions, but not the > instruction info field---you probably should add it) in dmesg. > Thanks for the heads up :) I've had a bit more time to spend on this this morning and attached is the VMCS dump. I've look at the vmcs_instruction_info and it appears the segment referenced is SS (which is in sync with the backtrace where the instruction causing the vmexit is "vmread %rbp, %rbp), and it has awkward attributes: SS: sel=0x0000, attr=0x1c000, limit=0xffffffff, base=0x0000000000000000 The lower 16 bits are all zero so KVM VMX emulation was injecting the GP(0) because we were about to write to a read-only segment. At least the stack isn't executable from what I can tell! Attached is the full VMCS dump where I've added a printk() to show the 'type' (all zeroes) and vmcs_instruction_info in case my above analysis is complete non-sense. Quentin
[ 9853.506447] kvm: wr: read-only segment type==0, info=e2614920 [ 9853.506464] *** Guest State *** [ 9853.506466] CR0: actual=0x0000000080050033, shadow=0x0000000080050033, gh_mask=fffffffffffffff7 [ 9853.506467] CR4: actual=0x00000000001526e0, shadow=0x00000000001526e0, gh_mask=fffffffffffff871 [ 9853.506467] CR3 = 0x000000007aa37000 [ 9853.506468] RSP = 0xffff83007b73fab0 RIP = 0xffff82d0801e629e [ 9853.506469] RFLAGS=0x00000202 DR7 = 0x0000000000000400 [ 9853.506470] Sysenter RSP=ffff83007b73ffc0 CS:RIP=e008:ffff82d08022c480 [ 9853.506471] CS: sel=0xe008, attr=0x0a09b, limit=0xffffffff, base=0x0000000000000000 [ 9853.506472] DS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000 [ 9853.506473] SS: sel=0x0000, attr=0x1c000, limit=0xffffffff, base=0x0000000000000000 [ 9853.506474] ES: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000 [ 9853.506475] FS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000 [ 9853.506476] GS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000 [ 9853.506477] GDTR: limit=0x0000efff, base=0xffff83007b4d7000 [ 9853.506478] LDTR: sel=0x0000, attr=0x1c000, limit=0xffffffff, base=0x0000000000000000 [ 9853.506479] IDTR: limit=0x00000fff, base=0xffff83007b4e3000 [ 9853.506480] TR: sel=0xe040, attr=0x0008b, limit=0x00000067, base=0xffff83007b4e6c80 [ 9853.506481] EFER = 0x0000000000000d00 PAT = 0x0000050100070406 [ 9853.506481] DebugCtl = 0x0000000000000000 DebugExceptions = 0x0000000000000000 [ 9853.506482] Interruptibility = 00000000 ActivityState = 00000000 [ 9853.506483] *** Host State *** [ 9853.506484] RIP = 0xffffffffa00f6daf RSP = 0xffff880131aafd00 [ 9853.506485] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040 [ 9853.506486] FSBase=00007fbf6bfff700 GSBase=ffff88021e240000 TRBase=ffff88021e253b40 [ 9853.506486] GDTBase=ffff88021e249000 IDTBase=ffffffffff57b000 [ 9853.506487] CR0=0000000080050033 CR3=0000000004b21000 CR4=00000000001426e0 [ 9853.506488] Sysenter RSP=0000000000000000 CS:RIP=0010:ffffffff81a02740 [ 9853.506489] EFER = 0x0000000000000d01 PAT = 0x0407010600070106 [ 9853.506490] *** Control State *** [ 9853.506491] PinBased=0000003f CPUBased=b6a06dfa SecondaryExec=000000eb [ 9853.506491] EntryControls=0000d3ff ExitControls=002fefff [ 9853.506492] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000 [ 9853.506493] VMEntry: intr_info=000000fc errcode=00000000 ilen=00000000 [ 9853.506494] VMExit: intr_info=00000000 errcode=00000000 ilen=00000006 [ 9853.506495] reason=00000017 qualification=0000000000000008 [ 9853.506495] IDTVectoring: info=00000000 errcode=00000000 [ 9853.506496] TSC Offset = 0xffffe8cdfc3ca592 [ 9853.506497] TPR Threshold = 0x00 [ 9853.506497] EPT pointer = 0x000000000467f01e [ 9853.506498] Virtual processor ID = 0x0007