Re: [PATCH 04/14] KVM: PPC: e500: MMU API

Avi Kivity <avi@xxxxxxxxxx> · Tue, 01 Nov 2011 10:58:33 +0200

On 10/31/2011 10:12 PM, Scott Wood wrote:
> >> +4.59 KVM_DIRTY_TLB
> >> +
> >> +Capability: KVM_CAP_SW_TLB
> >> +Architectures: ppc
> >> +Type: vcpu ioctl
> >> +Parameters: struct kvm_dirty_tlb (in)
> >> +Returns: 0 on success, -1 on error
> >> +
> >> +struct kvm_dirty_tlb {
> >> +	__u64 bitmap;
> >> +	__u32 num_dirty;
> >> +};
> > 
> > This is not 32/64 bit safe.  e500 is 32-bit only, yes?
>
> e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.
>
> > but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
> > padding here.
>
> What is unsafe about it?  Are you picturing TLBs with more than 4
> billion entries?

sizeof(struct kvm_tlb_dirty) == 12 for 32-bit userspace, but ==  16 for
64-bit userspace and the kernel.  ABI structures must have the same
alignment and size for 32/64 bit userspace, or they need compat handling.

> There shouldn't be any alignment issues.
>
> > Another alternative is to drop the num_dirty field (and let the kernel
> > compute it instead, shouldn't take long?), and have the third argument
> > to ioctl() reference the bitmap directly.
>
> The idea was to make it possible for the kernel to apply a threshold
> above which it would be better to ignore the bitmap entirely and flush
> everything:
>
> http://www.spinics.net/lists/kvm/msg50079.html
>
> Currently we always just flush everything, and QEMU always says
> everything is dirty when it makes a change, but the API is there if needed.

Right, but you don't need num_dirty for it.  There are typically only a
few dozen entries, yes?  It should take a trivial amount of time to
calculate its weight.

> >> +Configures the virtual CPU's TLB array, establishing a shared memory area
> >> +between userspace and KVM.  The "params" and "array" fields are userspace
> >> +addresses of mmu-type-specific data structures.  The "array_len" field is an
> >> +safety mechanism, and should be set to the size in bytes of the memory that
> >> +userspace has reserved for the array.  It must be at least the size dictated
> >> +by "mmu_type" and "params".
> >> +
> >> +While KVM_RUN is active, the shared region is under control of KVM.  Its
> >> +contents are undefined, and any modification by userspace results in
> >> +boundedly undefined behavior.
> >> +
> >> +On return from KVM_RUN, the shared region will reflect the current state of
> >> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> >> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
> >> +on this vcpu.
> > 
> > We already have another mechanism for such shared memory,
> > mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> > traditional kvm_run area.  Please consider using it.
>
> What does it buy us, other than needing a separate codepath in QEMU to
> allocate the memory differently based on whether KVM (and this feature)

The ability to use get_free_pages() and ordinary kernel memory directly,
instead of indirection through a struct page ** array.

> are being used, since QEMU uses this for its own MMU representation?
>
> This API has been discussed extensively, and the code using it is
> already in mainline QEMU.  This aspect of it hasn't changed since the
> discussion back in February:
>
> http://www.spinics.net/lists/kvm/msg50102.html
>
> I'd prefer to avoid another round of major overhaul without a really
> good reason.

Me too, but I also prefer not to make ABI choices by inertia.  ABI is
practically the only thing I care about wrt non-x86 (other than
whitespace, of course).  Please involve me in the discussions earlier in
the future.

> >> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
> >> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
> >> + - The "array" field points to an array of type "struct
> >> +   kvm_book3e_206_tlb_entry".
> >> + - The array consists of all entries in the first TLB, followed by all
> >> +   entries in the second TLB.
> >> + - Within a TLB, entries are ordered first by increasing set number.  Within a
> >> +   set, entries are ordered by way (increasing ESEL).
> >> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
> >> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> >> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
> >> +   hardware ignores this value for TLB0.
> > 
> > Holy shit.
>
> You were the one that first suggested we use shared data:
> http://www.spinics.net/lists/kvm/msg49802.html
>
> These are the assumptions needed to make such an interface well-defined.

Just remarking on the complexity, don't take it personally.

> >> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
> >>  	u32 tlb1cfg;
> >>  	u64 mcar;
> >>  
> >> +	struct page **shared_tlb_pages;
> >> +	int num_shared_tlb_pages;
> >> +
> > 
> > I missed the requirement that things be page aligned.
>
> They don't need to be, we'll ignore the data before and after the shared
> area.
>
> > If you use mmap(vcpu_fd) this becomes simpler; you can use
> > get_free_pages() and have a single pointer.  You can also use vmap() on
> > this array (but get_free_pages() is faster).
>
> We do use vmap().  This is just the bookkeeping so we know what pages to
> free later.
>

Ah, I missed that (and the pointer).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html