Re: RFC: New API for PPC for vcpu mmu access

Alexander Graf <agraf@xxxxxxx> · Sun, 13 Feb 2011 23:43:40 +0100

On 12.02.2011, at 01:57, Scott Wood wrote:

> On Fri, 11 Feb 2011 22:07:11 +0100
> Alexander Graf <agraf@xxxxxxx> wrote:
> 
>> 
>> On 11.02.2011, at 21:53, Scott Wood wrote:
>> 
>>> On Fri, 11 Feb 2011 02:41:35 +0100
>>> Alexander Graf <agraf@xxxxxxx> wrote:
>>> 
>>>>>> Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right?
>>> 
>>> Hmm, the idea is growing on me.
>>> 
>>>> So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa.
>>>> 
>>>> From kernel to user space is simple. We can just document that after every RUN, all fields can be modified.
>>>> From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb.
>>>> 
>>>> That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the "something changed" call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold.
>>> 
>>> OK.  We'll also need a config ioctl to specify MMU type/size and the address
>>> of the arrays.
>> 
>> Right, a setup call basically :).
> 
> OK, here goes v3:
> 
> [Note: one final thought after writing this up -- this isn't going to work
> too well in cases where the guest can directly manipulate its TLB, such as
> with the LRAT feature of Power Arch 2.06.  We'll still need a
> copy-in/copy-out mechanism for that.]

In that case qemu sets the mmu mode respectively and is aware of the fact that it needs get/set methods. We'll get to that when we have LRAT implemented :).

> struct kvmppc_book3e_tlb_entry {
> 	union {
> 		__u64 mas8_1;
> 		struct {
> 			__u32 mas8;
> 			__u32 mas1;
> 		};
> 	};
> 	__u64 mas2;
> 	union {
> 		__u64 mas7_3	
> 		struct {
> 			__u32 mas7;
> 			__u32 mas3;
> 		};
> 	};
> };

Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). Is there any obvious reason we need to have unions in the first place? The compiler should be clever enough to pick the right size accessors when writing/reading masked  __u64 entries, no? The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right?

> 
> For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is
> present but not supported.
> 
> struct kvmppc_book3e_tlb_params {
> 	/*
> 	 * book3e defines 4 TLBs.  Individual implementations may have
> 	 * fewer.  TLBs that do not already exist on the target must be
> 	 * configured with a size of zero.  A tlb_ways value of zero means
> 	 * the array is fully associative.  Only TLBs that are already
> 	 * set-associative on the target may be configured with a different
> 	 * associativity.  A set-associative TLB may not exceed 255 ways.
> 	 *
> 	 * KVM will adjust TLBnCFG based on the sizes configured here,
> 	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> 	 * set to zero.
> 	 *
> 	 * The size of any TLB that is set-associative must be a multiple of
> 	 * the number of ways, and the number of sets must be a power of two.
> 	 *
> 	 * The page sizes supported by a TLB shall be determined by reading
> 	 * the TLB configuration registers.  This is not adjustable by userspace.
> 	 * [Note: need sregs]
> 	 */
> 	__u32 tlb_sizes[4];
> 	__u8 tlb_ways[4];
> 	__u32 reserved[11];
> };
> 
> KVM_CONFIG_TLB
> --------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_config_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_config_tlb {
> 	__u64 params;
> 	__u64 array;
> 	__u32 mmu_type;
> 	__u32 array_len;

Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a "features" int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right?

> };
> 
> Configures the virtual CPU's TLB array, establishing a shared memory area
> between userspace and KVM.  The "params" and "array" fields are userspace
> addresses of mmu-type-specific data structures.  The "array_len" field is an
> safety mechanism, and should be set to the size in bytes of the memory that
> userspace has reserved for the array.  It must be at least the size dictated by
> "mmu_type" and "params".
> 
> On return from this call, KVM assumes the state of the TLBs to be empty. 
> Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
> any valid entries.
> 
> While KVM_RUN is active, the shared region is under control of KVM.  Its
> contents are undefined, and any modification by userspace results in boundedly
> undefined behavior.
> 
> On return from KVM_RUN, the shared region will reflect the current state of
> the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> to tell KVM which entries have been changed, prior to calling KVM_RUN again
> on this vcpu.
> 
> For mmu types KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV:
> - The "params" field is of type "struct kvmppc_book3e_tlb_params".
> - The "array" field points to an array of type "struct kvmppc_book3e_tlb_entry".
> - The array consists of all entries in the first TLB, followed by all
>   entries in the second TLB, etc.
> - Within a TLB, if the array is not set-associative, entries are ordered by
>   increasing ESEL.  If the array is set-associative, entries are ordered first
>   by set.  Within a set, entries are ordered by way (ESEL).
> - The hash for determining set number is:
>     (MAS2[EPN] / page_size) & (num_sets - 1)
>   where "page_size" is the smallest page size supported by the TLB, and
>   "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>   If a book3e chip is made for which a different hash is needed, a new
>   MMU type must be used, to ensure that userspace and KVM agree on layout.

Please state the size explicitly then. It's 1k, right?

> 
> KVM_DIRTY_TLB
> -------------
> 
> Capability: KVM_CAP_SW_TLB
> Type: vcpu ioctl
> Parameters: struct kvm_dirty_tlb (in)
> Returns: 0 on success
>         -1 on error
> 
> struct kvm_dirty_tlb {
> 	__u64 bitmap;
> 	__u32 num_dirty;
> };
> 
> This must be called whenever userspace has changed an entry in the shared
> TLB, prior to calling KVM_RUN on the associated vcpu.
> 
> The "bitmap" field is the userspace address of an array.  This array consists
> of a number of bits, equal to the total number of TLB entries as determined by the
> last successful call to KVM_CONFIG_TLB, rounded up to the nearest multiple of 64.
> 
> Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array.
> 
> The array is little-endian: the bit 0 is the least significant bit of the
> first byte, bit 8 is the least significant bit of the second byte, etc.
> This avoids any complications with differing word sizes.
> 
> The "num_dirty" field is a performance hint for KVM to determine whether it
> should skip processing the bitmap and just invalidate everything.  It must
> be set to the number of set bits in the bitmap.

Sounds very nice :). I'm not sure if we need extensibility for the dirty mechanism. It doesn't really hurt to add it, but why do so when it's not required? Hrm. I'll leave that up to you :).

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html