Re: RFC: New API for PPC for vcpu mmu access

Alexander Graf <agraf@xxxxxxx> · Thu, 3 Feb 2011 10:19:06 +0100

On 02.02.2011, at 23:08, Scott Wood wrote:

> On Wed, 2 Feb 2011 22:33:41 +0100
> Alexander Graf <agraf@xxxxxxx> wrote:
> 
>> 
>> On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
>> 
>>> Below is a proposal for a new API for PPC to allow KVM clients
>>> to set MMU state in a vcpu.
>>> 
>>> BookE processors have one or more software managed TLBs and
>>> currently there is no mechanism for Qemu to initialize
>>> or access them.  This is needed for normal initialization
>>> as well as debug.
>>> 
>>> There are 4 APIs:
>>> 
>>> -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type
>>> of MMU with KVM-- the type determines the size and format
>>> of the data in the other APIs
>> 
>> This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type.
> 
> Well, for one, we don't have sregs or a PVR hint on Book E yet. :-)

Ah, right. The BookE code just passes its host PVR to the guest. :)

> But also, there could be differing levels of support -- e.g. on e500mc,
> we have no plans to support exposing the hardware virtualization
> features in a nested manner (nor am I sure that it's reasonably
> possible).  But if someone does it, that would be a change in the
> interface between Qemu and KVM to allow the extra fields to be set,
> with no change in PVR.
> 
> Likewise, a new chip could introduce new capabilities, but still be
> capable of working the old way.
> 
> Plus, basing it on PVR means Qemu needs to be updated every time
> there's a new chip with a new PVR.

Ok, convinced. We need a way to choose an mmu model.

> 
>>> -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all
>>> TLBs in the vcpu
>>> 
>>> -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture
>>> specifies the format of the MMU data passed in
>> 
>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> 
> I have no idea what things will look like 10 years down the road, but
> currently e500mc has 576 entries (512 TLB0, 64 TLB1).

That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

> 
>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> 
> No, we don't know the size (or necessarily even the structure) of the
> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> as a cache to reduce the number of TLB misses that have to go to the
> guest (we do this now in another hypervisor).
> 
> Plus sometimes it's just simpler -- why bother halving the size of the
> guest TLB when running on e500v1?

Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). Maybe this even needs to be potentially runtime switchable, in case you boot off with u-boot in the guest, load a kernel and the kernel activates some PV extensions.

> 
>>> KVM_PPC_INVALIDATE_TLB
>>> ----------------------
>>> 
>>> Capability: KVM_CAP_PPC_MMU
>>> Architectures: powerpc
>>> Type: vcpu ioctl
>>> Parameters: none
>>> Returns: 0 on success, -1 on error
>>> 
>>> Invalidates all TLB entries in all TLBs of the vcpu.
>> 
>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> 
> The other way to look at it is that there's no need for a reset
> function if all the state is properly settable. :-)

You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

> 
> Which we want anyway for debugging (and migration, though I wonder if
> anyone would actually use that with embedded hardware).

We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

> 
>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>> 
>> struct kvm_ppc_booke_tlbe {
>>    __u64 data[8];
>> };
>> 
>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> 
> If it's not exposed to userspace, how is userspace going to
> interpret/fill in the data?

It can overlay cast according to the MMU type. So userspace still has to know the layout of the tlbe, but it doesn't have to be defined with a huge amount of anonymous unions. An alternative would be to explicitly define each mmu type's entries:

struct kvm_ppc_booke_tlbe {
    union {
        struct {
            ...
        } tlbe_e500;
        struct {
            ...
        } tlbe_e500mc;
        struct {
            ...
        } tlbe_e500mc_hv;
        __u64 pad[x];
    }
};

> As for kernel headers, I think qemu needs to provide its own copy, like
> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> for programs which rely on recent kernel APIs (which Qemu+KVM tends
> to do already).

Yeah, tedious story...

> 
>> Userspace should only really need the TLB entries for
>> 
>>  1) Debugging
>>  2) Migration
>> 
>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> 
> The third case is reset/init, which can be performance sensitive
> (especially in failover setups).

This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

> And debugging can require single translations, and can be a
> performance issue if you need to toss around several kilobytes of data
> per translation, and a debugger is doing e.g. a automated pattern of
> single step plus inspect memory.

Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows:

 * kvm goes to qemu
 * qemu fetches all mmu and register data from kvm
 * qemu runs its mmu resolution function as if the target was emulated

So the "normal" way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy > 30kb of data only to get the mmu data either :).

So I guess we need both. We need a full get call to facilitate migration and savevm and we can then accelerate it using a direct lookup call. Here too, I'd prefer to see the generic one first. But I do agree that it's a lot of data with high frequency, so it might make sense to expose both on the same CAP.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html