Re: RFC: New API for PPC for vcpu mmu access

Alexander Graf <agraf@xxxxxxx> · Mon, 7 Feb 2011 16:43:02 +0100

On 04.02.2011, at 23:33, Scott Wood wrote:

> On Thu, 3 Feb 2011 10:19:06 +0100
> Alexander Graf <agraf@xxxxxxx> wrote:
> 
>> On 02.02.2011, at 23:08, Scott Wood wrote:
>>> On Wed, 2 Feb 2011 22:33:41 +0100
>>> Alexander Graf <agraf@xxxxxxx> wrote:
>>>> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
>>> 
>>> I have no idea what things will look like 10 years down the road, but
>>> currently e500mc has 576 entries (512 TLB0, 64 TLB1).
>> 
>> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.
> 
> And that's only with the standard hardware TLB size.  On Topaz (our
> standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
> It speeds up some workloads nicely, but invalidation-heavy loads get
> hurt.

Yup - I do a similar trick for Book3S. Just that the TLB is implementation dependent anyways and mostly hidden from the OS :).

> 
>>>> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
>>> 
>>> No, we don't know the size (or necessarily even the structure) of the
>>> TLB.  KVM may provide a guest TLB that is larger than what hardware has,
>>> as a cache to reduce the number of TLB misses that have to go to the
>>> guest (we do this now in another hypervisor).
>>> 
>>> Plus sometimes it's just simpler -- why bother halving the size of the
>>> guest TLB when running on e500v1?
>> 
>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).
> 
> Maybe, though keeping it in KVM means we can change it whenever we want
> without having to sync up Qemu and worry about backward compatibility.

Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace.

> Same-as-hardware TLB geometry with a Qemu-specified number of sets is
> probably good enough for the forseeable future, though.  There were
> some other schemes we considered a while back for Topaz, but we ended
> up just going with a larger version of what's in hardware.

Sounds good. Not sure how we'd handle page tables in this whole scheme yet, but I guess we really should take one step at a time.

> 
>> Maybe this even needs to be potentially runtime switchable, in case
>> you boot off with u-boot in the guest, load a kernel and the kernel
>> activates some PV extensions.
> 
> U-Boot should be OK with it -- the increased TLB size is
> architecturally valid, and U-boot doesn't do much with TLB0 anyway.
> 
> If we later have true PV extensions such as a page table, that'd be
> another matter.

Yeah. Or imagine we realize that we should make the TLB size dynamic, so that non-PV aware guest OSs get the exact same behavior as on real hardware (because they're buggy for example) while PV aware guests can bump up the TLB size. But really, let's worry about that later.

> 
>>>> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
>>> 
>>> The other way to look at it is that there's no need for a reset
>>> function if all the state is properly settable. :-)
>> 
>> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.
> 
> Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".
> 
> This is meant to be used by a qemu reset function.  If there's a 
> full-tlb set, that could be used instead, though it'd be slower.  With
> the API as proposed it's needed to clear the slate before you set the
> individual entries you want.

Yeah, let's just go for a full TLB get/set and optimize specific slow parts later. It's easier to fine-tune stuff that's slow than to miss out on functionality because we were trying to be too clever from the start.

> 
>>> Which we want anyway for debugging (and migration, though I wonder if
>>> anyone would actually use that with embedded hardware).
>> 
>> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.
> 
> Oh, I agree -- or I wouldn't have even mentioned it. :-)
> 
> I just wouldn't make it the primary optimization concern at this point.

No, not at all - no optimization for it. But keeping the door open :).

> 
>>>> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
>>>> 
>>>> struct kvm_ppc_booke_tlbe {
>>>>   __u64 data[8];
>>>> };
>>>> 
>>>> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
>>> 
>>> If it's not exposed to userspace, how is userspace going to
>>> interpret/fill in the data?
>> 
>> It can overlay cast according to the MMU type.
> 
> How's that different from backing the void pointer up with a different
> struct depending on the MMU type?  We weren't proposing unions.
> 
> A fixed array does mean you wouldn't have to worry about whether qemu
> supports the more advanced struct format if fields are added --
> you can just unconditionally write it, as long as it's backwards
> compatible.  Unless you hit the limit of the pre-determined array size,
> that is.  And if that gets made higher for future expansion, that's
> even more data that has to get transferred, before it's really needed.

Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is?

struct request_ppc_tlb {
    int tlb_type;
    int tlb_entries;
    uint64_t __user *tlb_data
};

in qemu:

struct request_ppc_tlb req;
reg.tlb_data = qemu_malloc(PPC_TLB_SIZE_MAX);
r = do_ioctl(REQUEST_PPC_TLB, &req);
if (r == -ENOMEM) {
    cpu_abort(env, "TLB too big");
}

switch (reg.tlb_type) {
case PPC_TLB_xxx:
    copy_reg_to_tlb_for_xxx(env, reg.tlb_data);
}

something like this. Then we should be flexible enough for the foreseeable future and make it possible for kernel space to switch MMU modes in case we need that.

> 
>>> As for kernel headers, I think qemu needs to provide its own copy, like
>>> qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
>>> for programs which rely on recent kernel APIs (which Qemu+KVM tends
>>> to do already).
>> 
>> Yeah, tedious story...
> 
> I guess it's come up before?

It has and has never been really resolved I think. Hits me every other day on book3s.

> 
>>>> Userspace should only really need the TLB entries for
>>>> 
>>>> 1) Debugging
>>>> 2) Migration
>>>> 
>>>> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
>>> 
>>> The third case is reset/init, which can be performance sensitive
>>> (especially in failover setups).
>> 
>> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.
> 
> The API as proposed can do a full TLB set (if you start with
> invalidate), and a full TLB get (by iterating).  So it's an
> optimization decision in either direction.

Would you really want to loop through 16k entries, doing an ioctl for each? Then performance really would always be an issue.

> If we do decide to mandate a standard geometry TLB, just with
> settable size, then doing a full get/set has a simplicity advantage.
> The iteration approach was intended to preserve flexibility of
> implementation.  And then for optimization, we could add an interface
> to get/set a single entry, that doesn't need to support iteration.

It's more than simplicity. It's also speed because you save the individual ioctl on each. I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html