Re: [edk2] apparent KVM problem with LRET in TianoCore S3 resume trampoline

Laszlo Ersek <lersek@xxxxxxxxxx> · Sun, 08 Dec 2013 18:43:26 +0100

On 12/06/13 13:03, Paolo Bonzini wrote:
> Il 05/12/2013 19:29, Laszlo Ersek ha scritto:
>> On 12/05/13 18:42, Paolo Bonzini wrote:
>>> Il 05/12/2013 17:12, Laszlo Ersek ha scritto:
>>>> Hi,
>>>>
>>>> I'm working on S3 suspend/resume in OVMF. The problem is that I'm getting an
>>>> unexpected guest reboot for code (LRET) that works on physical hardware. I
>>>> tried to trace the problem with ftrace, but I didn't get any mentions of
>>>> em_ret_far(). (Maybe I was looking in the wrong place.)
>>>
>>> What does ftrace say anyway?
>>
>> (pls. see in the next msg I sent)
> 
> Actually I meant the ftrace without any patches.
> 
> Thanks to your binary I now reproduced the issue and it looks like the
> 64-bit->16-bit switch works:

Thank you for spending (apparently more than a little) time on this!

> 
>  qemu-system-x86-4081  [001] 62650.335040: kvm_exit:             reason CR_ACCESS rip 0x3cf7ae45 info 0 0
>  qemu-system-x86-4081  [001] 62650.335041: kvm_cr:               cr_write 0 = 0x32
>  qemu-system-x86-4081  [001] 62650.335046: kvm_entry:            vcpu 0
> 
> 	This is the "mov %rax, %cr0". PE and PG are turned off.

I'm surprised by this result. The instruction you refer to is below
"_AsmTransferControl_al_0000" (in the original, unpatched code).

I had earlier added an infinite loop right below that label (a different
loop than my xxxx debug loop), and it was *never* reached in my test.
That is, from the lret that I reported as problematic, to the
instruction you refer to, the CPU would have had to cross (and finish)
the infinite loop that I had added earlier. And that never happened in
my test.

I had added that loop at "_AsmTransferControl_al_0000" immediately
precisely because I wanted to see if the label is reached and the
problem is with something below that label, or with the first lret. I
sent my email to the KVM list after I had isolated the problem to the
first LRET:

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5297/focus=5325

On 12/04/13 19:05, Laszlo Ersek wrote:
> I tested if the (intended) target location of the LRET is reached, and
> it is not. (It's easy to test by adding a small infinite loop, moving
> it around, and seeing if the VM is spinning with or without producing
> a bunch of output on the debug port.) It's *really* that
> internally-targeted LRET that causes a reboot. [...]

I have absolutely no clue why this code executes for you and doesn't for
me :) What guest RAM size did you test with?

>  qemu-system-x86-4081  [001] 62650.335047: kvm_exit:             reason MSR_READ rip 0x3cf7ae4e info 0 0
>  qemu-system-x86-4081  [001] 62650.335048: kvm_msr:              msr_read c0000080 = 0x100
>  qemu-system-x86-4081  [001] 62650.335048: kvm_entry:            vcpu 0
>  qemu-system-x86-4081  [001] 62650.335048: kvm_exit:             reason MSR_WRITE rip 0x3cf7ae53 info 0 0
>  qemu-system-x86-4081  [001] 62650.335049: kvm_msr:              msr_write c0000080 = 0x0
>  qemu-system-x86-4081  [001] 62650.335050: kvm_entry:            vcpu 0
> 
> 	LME is turned off.
> 
>  qemu-system-x86-4081  [001] 62650.335050: kvm_exit:             reason CR_ACCESS rip 0x3cf7ae55 info 304 0
>  qemu-system-x86-4081  [001] 62650.335050: kvm_cr:               cr_write 4 = 0x640
>  qemu-system-x86-4081  [001] 62650.335053: kvm_entry:            vcpu 0
> 
> 	PAE is turned off.
> 
>  qemu-system-x86-4081  [001] 62650.335054: kvm_exit:             reason CR_ACCESS rip 0x11e6 info 0 0
>  qemu-system-x86-4081  [001] 62650.335054: kvm_cr:               cr_write 0 = 0x33
>  qemu-system-x86-4081  [001] 62650.335054: kvm_entry:            vcpu 0
> 
> 	Here we're already in real mode.  The weird RIP is explained by
> 	the first few bytes after the FACS resume vector:

>From this point on you were debugging the Linux wakeup code, in
"arch/x86/realmode/rm/wakeup_asm.S". I think.

> 
> 		0x9a1d:0000:  cli    
> 		0x9a1d:0001:  cld    
> 		0x9a1d:0002:  ljmp   $9900,$11d7

ENTRY(wakeup_start)
        cli
        cld

        LJMPW_RM(3f)
3:
        /* Apparently some dimwit BIOS programmers don't know how to
           program a PM to RM transition, and we might end up here with
           junk in the data segment descriptor registers.  The only way
           to repair that is to go into PM and fix it ourselves... */
[...]

>From Linux kernel commit 4b4f7280.

> The page tables are, ahem, crap:
> 
> 000c000: 6750 fe01 0000 0000 0000 0000 0000 0000  gP..............
> 000c010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c0a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c0b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c0c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c0d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c0e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000c0f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 
> This is 0x9c000.  Strikes any bell?

We're wildly corrupting OS memory during OVMF S3 resume. That's a known
problem and the next stage for me to figure out (with Jordan's help
hopefully):

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5297/focus=5321
http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5297/focus=5325

So, your tracing reached / debugged code that I had never ever reached.
And my report was precisely about not reaching it. Once we reach it,
it's expected to blow up, but first I wanted to get there.

Again, the 64-bit->16-bit switch (in the original, unpatched edk2/OVMF
code) never worked for me.

I think I did find the reason for that though, please see

http://thread.gmane.org/gmane.comp.bios.tianocore.devel/5343/focus=5365

especially the last patch attached to it.

The likely reason for the failure I was seeing is that the 16-bit code
had been relocated to way above 1MB and could not be addressed with the
16-bit CS:IP notation at all.

Thanks!
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html