Re: [PATCH 04/14] KVM: PPC: e500: MMU API

Scott Wood <scottwood@xxxxxxxxxxxxx> · Mon, 31 Oct 2011 15:12:44 -0500

On 10/31/2011 08:24 AM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> From: Scott Wood <scottwood@xxxxxxxxxxxxx>
>>
>> This implements a shared-memory API for giving host userspace access to
>> the guest's TLB.
>>
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 7945b0b..ab1136f 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -1383,6 +1383,38 @@ The following flags are defined:
>>  If datamatch flag is set, the event will be signaled only if the written value
>>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>>  
>> +4.59 KVM_DIRTY_TLB
>> +
>> +Capability: KVM_CAP_SW_TLB
>> +Architectures: ppc
>> +Type: vcpu ioctl
>> +Parameters: struct kvm_dirty_tlb (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +struct kvm_dirty_tlb {
>> +	__u64 bitmap;
>> +	__u32 num_dirty;
>> +};
> 
> This is not 32/64 bit safe.  e500 is 32-bit only, yes?

e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.

> but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
> padding here.

What is unsafe about it?  Are you picturing TLBs with more than 4
billion entries?

There shouldn't be any alignment issues.

> Another alternative is to drop the num_dirty field (and let the kernel
> compute it instead, shouldn't take long?), and have the third argument
> to ioctl() reference the bitmap directly.

The idea was to make it possible for the kernel to apply a threshold
above which it would be better to ignore the bitmap entirely and flush
everything:

http://www.spinics.net/lists/kvm/msg50079.html

Currently we always just flush everything, and QEMU always says
everything is dirty when it makes a change, but the API is there if needed.

>>  4.62 KVM_CREATE_SPAPR_TCE
>>  
>>  Capability: KVM_CAP_SPAPR_TCE
>> @@ -1700,3 +1732,45 @@ HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
>>  HTAB invisible to the guest.
>>  
>>  When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
>> +
>> +6.3 KVM_CAP_SW_TLB
>> +
>> +Architectures: ppc
>> +Parameters: args[0] is the address of a struct kvm_config_tlb
>> +Returns: 0 on success; -1 on error
>> +
>> +struct kvm_config_tlb {
>> +	__u64 params;
>> +	__u64 array;
>> +	__u32 mmu_type;
>> +	__u32 array_len;
>> +};
> 
> Would it not be simpler to use args[0-3] for this, instead of yet
> another indirection?

I suppose so.  Its existence as a struct dates from when it was its own
ioctl rather than an argument to KVM_ENABLE_CAP.

>> +Configures the virtual CPU's TLB array, establishing a shared memory area
>> +between userspace and KVM.  The "params" and "array" fields are userspace
>> +addresses of mmu-type-specific data structures.  The "array_len" field is an
>> +safety mechanism, and should be set to the size in bytes of the memory that
>> +userspace has reserved for the array.  It must be at least the size dictated
>> +by "mmu_type" and "params".
>> +
>> +While KVM_RUN is active, the shared region is under control of KVM.  Its
>> +contents are undefined, and any modification by userspace results in
>> +boundedly undefined behavior.
>> +
>> +On return from KVM_RUN, the shared region will reflect the current state of
>> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
>> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
>> +on this vcpu.
> 
> We already have another mechanism for such shared memory,
> mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> traditional kvm_run area.  Please consider using it.

What does it buy us, other than needing a separate codepath in QEMU to
allocate the memory differently based on whether KVM (and this feature)
are being used, since QEMU uses this for its own MMU representation?

This API has been discussed extensively, and the code using it is
already in mainline QEMU.  This aspect of it hasn't changed since the
discussion back in February:

http://www.spinics.net/lists/kvm/msg50102.html

I'd prefer to avoid another round of major overhaul without a really
good reason.

>> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
>> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
>> + - The "array" field points to an array of type "struct
>> +   kvm_book3e_206_tlb_entry".
>> + - The array consists of all entries in the first TLB, followed by all
>> +   entries in the second TLB.
>> + - Within a TLB, entries are ordered first by increasing set number.  Within a
>> +   set, entries are ordered by way (increasing ESEL).
>> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
>> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
>> +   hardware ignores this value for TLB0.
> 
> Holy shit.

You were the one that first suggested we use shared data:
http://www.spinics.net/lists/kvm/msg49802.html

These are the assumptions needed to make such an interface well-defined.

>> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
>>  	u32 tlb1cfg;
>>  	u64 mcar;
>>  
>> +	struct page **shared_tlb_pages;
>> +	int num_shared_tlb_pages;
>> +
> 
> I missed the requirement that things be page aligned.

They don't need to be, we'll ignore the data before and after the shared
area.

> If you use mmap(vcpu_fd) this becomes simpler; you can use
> get_free_pages() and have a single pointer.  You can also use vmap() on
> this array (but get_free_pages() is faster).

We do use vmap().  This is just the bookkeeping so we know what pages to
free later.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html