Re: RFC: New API for PPC for vcpu mmu access

Scott Wood <scottwood@xxxxxxxxxxxxx> · Fri, 4 Feb 2011 16:33:38 -0600

On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf <agraf@xxxxxxx> wrote:

> On 02.02.2011, at 23:08, Scott Wood wrote:
> > On Wed, 2 Feb 2011 22:33:41 +0100
> > Alexander Graf <agraf@xxxxxxx> wrote:
> >> This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years?
> > 
> > I have no idea what things will look like 10 years down the road, but
> > currently e500mc has 576 entries (512 TLB0, 64 TLB1).
> 
> That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA.

And that's only with the standard hardware TLB size.  On Topaz (our
standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
It speeds up some workloads nicely, but invalidation-heavy loads get
hurt.

> >> Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around.
> > 
> > No, we don't know the size (or necessarily even the structure) of the
> > TLB.  KVM may provide a guest TLB that is larger than what hardware has,
> > as a cache to reduce the number of TLB misses that have to go to the
> > guest (we do this now in another hypervisor).
> > 
> > Plus sometimes it's just simpler -- why bother halving the size of the
> > guest TLB when running on e500v1?
> 
> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :).

Maybe, though keeping it in KVM means we can change it whenever we want
without having to sync up Qemu and worry about backward compatibility.

Same-as-hardware TLB geometry with a Qemu-specified number of sets is
probably good enough for the forseeable future, though.  There were
some other schemes we considered a while back for Topaz, but we ended
up just going with a larger version of what's in hardware.

> Maybe this even needs to be potentially runtime switchable, in case
> you boot off with u-boot in the guest, load a kernel and the kernel
> activates some PV extensions.

U-Boot should be OK with it -- the increased TLB size is
architecturally valid, and U-boot doesn't do much with TLB0 anyway.

If we later have true PV extensions such as a page table, that'd be
another matter.

> >> The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset.
> > 
> > The other way to look at it is that there's no need for a reset
> > function if all the state is properly settable. :-)
> 
> You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs.

Sorry, I misread "reset function in qemu" as "reset ioctl in KVM".

This is meant to be used by a qemu reset function.  If there's a 
full-tlb set, that could be used instead, though it'd be slower.  With
the API as proposed it's needed to clear the slate before you set the
individual entries you want.

> > Which we want anyway for debugging (and migration, though I wonder if
> > anyone would actually use that with embedded hardware).
> 
> We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable.

Oh, I agree -- or I wouldn't have even mentioned it. :-)

I just wouldn't make it the primary optimization concern at this point.

> >> Haven't fully made up my mind on the tlb entry structure yet. Maybe something like
> >> 
> >> struct kvm_ppc_booke_tlbe {
> >>    __u64 data[8];
> >> };
> >> 
> >> would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :).
> > 
> > If it's not exposed to userspace, how is userspace going to
> > interpret/fill in the data?
> 
> It can overlay cast according to the MMU type.

How's that different from backing the void pointer up with a different
struct depending on the MMU type?  We weren't proposing unions.

A fixed array does mean you wouldn't have to worry about whether qemu
supports the more advanced struct format if fields are added --
you can just unconditionally write it, as long as it's backwards
compatible.  Unless you hit the limit of the pre-determined array size,
that is.  And if that gets made higher for future expansion, that's
even more data that has to get transferred, before it's really needed.

> > As for kernel headers, I think qemu needs to provide its own copy, like
> > qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
> > for programs which rely on recent kernel APIs (which Qemu+KVM tends
> > to do already).
> 
> Yeah, tedious story...

I guess it's come up before?

> >> Userspace should only really need the TLB entries for
> >> 
> >>  1) Debugging
> >>  2) Migration
> >> 
> >> So I don't see the point in making the interface optimized for single TLB entries. Do you have other use cases in mind?
> > 
> > The third case is reset/init, which can be performance sensitive
> > (especially in failover setups).
> 
> This is an acceleration. The generic approach needs to come first (generic set of the full TLB). Then we can measure if it really does take too long and add another flush call.

The API as proposed can do a full TLB set (if you start with
invalidate), and a full TLB get (by iterating).  So it's an
optimization decision in either direction.

If we do decide to mandate a standard geometry TLB, just with
settable size, then doing a full get/set has a simplicity advantage.
The iteration approach was intended to preserve flexibility of
implementation.  And then for optimization, we could add an interface
to get/set a single entry, that doesn't need to support iteration.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html