Re: [PATCH 26/26] KVM: PPC: Add Documentation about PV interface

Alexander Graf <agraf@xxxxxxx> · Mon, 28 Jun 2010 09:49:33 +0200

On 28.06.2010, at 09:18, Milton Miller wrote:

> On Sun Jun 27 around 19:33:52 EST 2010 Alexander Graf wrote:
>> Am 27.06.2010 um 10:14 schrieb Avi Kivity <avi at redhat.com>:
>>> On 06/26/2010 02:25 AM, Alexander Graf wrote:
> 
>>>> +
>>>> +PPC hypercalls
>>>> +==============
>>>> +
>>>> +The only viable ways to reliably get from guest context to host  
>>>> context are:
>>>> +
>>>> +    1) Call an invalid instruction
>>>> +    2) Call the "sc" instruction with a parameter to "sc"
>>>> +    3) Call the "sc" instruction with parameters in GPRs
>>>> +
>>>> +Method 1 is always a bad idea. Invalid instructions can be  
>>>> replaced later on
>>>> +by valid instructions, rendering the interface broken.
>>>> +
>>>> +Method 2 also has downfalls. If the parameter to "sc" is != 0 the  
>>>> spec is
>>>> +rather unclear if the sc is targeted directly for the hypervisor  
>>>> or the
>>>> +supervisor. It would also require that we read the syscall issuing  
>>>> instruction
>>>> +every time a syscall is issued, slowing down guest syscalls.
>>>> +
> 
> It goes to the hypervisor, and it would require the hypervisor to
> return to the supervisor, but I believe it just returns to the user with
> permission denied.

That's what I assumed, yeah :(.

> 
>>>> +Method 3 is what KVM uses. We pass magic constants  
>>>> (KVM_SC_MAGIC_R3 and
>>>> +KVM_SC_MAGIC_R4) in r3 and r4 respectively. If a syscall  
>>>> instruction with these
>>>> +magic values arrives from the guest's kernel mode, we take the  
>>>> syscall as a
>>>> +hypercall.
>>>> 
>>> 
>>> Is there any chance a normal syscall will have those values in r3  
>>> and r4?
>> 
>> r3 is the syscall number. So as long as the guest doesn't reuse that  
>> value, we're safe. Since in general syscall numbers are not randomly  
>> scattered throughout the number range, we should be ok here.
>> 
> 
> No, r0 has the system call number.  Registers 3 and 4 are the first
> 2 args in c abi (or first 64 bit arg in 32 bit c abi), but the linux
> syscall abi special.  (In addition, it returns success or failure in
> cr0).

Oh. Ahem :)

> 
>>> 
>>> If so, maybe it's better to use pc as they key for hypercalls.  Let  
>>> the guest designate one instruction address as the hypercall call  
>>> point; kvm can easily check it and reflect it back to the guest if  
>>> it doesn't match.
>>> 
>> 
>> You mean the guest would tell the hv where the hypercall lies? That  
>> would require a hypercall, no? Defining it statically is tricky. I  
>> want to PV'nize osx using a kernel module later, so I don't have  
>> control over the physical layout.
>> 
>>> Is it valid and useful to issue sc from privileged mode anyway,  
>>> except for calling the hypervisor?
>> 
>> Same as a syscall on x86 really. The kernel can and does issue  
>> syscalls within itself.
>> 
>> 
> 
> I don't believe we support the kernel actually doing a syscall to itself
> anymore, at least on powerpc.  The callers call the underlying system
> call function, or kernel_thread.
> 
> That said, I would suggest we allocate a syscall number for this, as it
> would document the usage.  (In additon to 0..nr_syscalls - 1 we have
> 0x1ebe in use).

That's actually a pretty good idea.

> 
> Also, is there any desire to nest such emulation?

Nesting should just work, right? Since we only accept hypercalls from PR=0 and guests run in PR=1, we get the sc interrupt in the l1 guest by then.

The only issue I'm aware of that completely breaks when using nested KVM on PPC is the MSR_IR != MSR_DR logic. We fetch the instruction we got an interrupt on for certain interrupts in the world switch handler by keeping MSR_IR=0, but setting MSR_DR=1. And KVM speeds up MSR_DR != MSR_IR by mapping both of them lazily in a special address space. So if you access the same page as instruction and as data, you get an invalid result.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html