On 06/17/2013 06:02 PM, Alexander Graf wrote: > > On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote: > >> On 06/17/2013 08:06 AM, Alexander Graf wrote: >>> >>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote: >>> >>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and >>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO >>>> devices or emulated PCI. These calls allow adding multiple entries >>>> (up to 512) into the TCE table in one call which saves time on >>>> transition to/from real mode. >>>> >>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs >>>> (copied from user and verified) before writing the whole list into >>>> the TCE table. This cache will be utilized more in the upcoming >>>> VFIO/IOMMU support to continue TCE list processing in the virtual >>>> mode in the case if the real mode handler failed for some reason. >>>> >>>> This adds a guest physical to host real address converter >>>> and calls the existing H_PUT_TCE handler. The converting function >>>> is going to be fully utilized by upcoming VFIO supporting patches. >>>> >>>> This also implements the KVM_CAP_PPC_MULTITCE capability, >>>> so in order to support the functionality of this patch, QEMU >>>> needs to query for this capability and set the "hcall-multi-tce" >>>> hypertas property only if the capability is present, otherwise >>>> there will be serious performance degradation. >>>> >>>> Cc: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx> >>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> >>>> Signed-off-by: Paul Mackerras <paulus@xxxxxxxxx> >>> >>> Only a few minor nits. Ben already commented on implementation details. >>> >>>> >>>> --- >>>> Changelog: >>>> 2013/06/05: >>>> * fixed mistype about IBMVIO in the commit message >>>> * updated doc and moved it to another section >>>> * changed capability number >>>> >>>> 2013/05/21: >>>> * added kvm_vcpu_arch::tce_tmp >>>> * removed cleanup if put_indirect failed, instead we do not even start >>>> writing to TCE table if we cannot get TCEs from the user and they are >>>> invalid >>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce >>>> and kvmppc_emulated_validate_tce (for the previous item) >>>> * fixed bug with failthrough for H_IPI >>>> * removed all get_user() from real mode handlers >>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) >>>> --- >>>> Documentation/virtual/kvm/api.txt | 17 ++ >>>> arch/powerpc/include/asm/kvm_host.h | 2 + >>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +- >>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++ >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++---- >>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++ >>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + >>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++- >>>> arch/powerpc/kvm/powerpc.c | 3 + >>>> include/uapi/linux/kvm.h | 1 + >>>> 10 files changed, 473 insertions(+), 32 deletions(-) >>>> >>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt >>>> index 5f91eda..6c082ff 100644 >>>> --- a/Documentation/virtual/kvm/api.txt >>>> +++ b/Documentation/virtual/kvm/api.txt >>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be >>>> handled. >>>> >>>> >>>> +4.83 KVM_CAP_PPC_MULTITCE >>>> + >>>> +Capability: KVM_CAP_PPC_MULTITCE >>>> +Architectures: ppc >>>> +Type: vm >>>> + >>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls >>>> +handling is supported by the kernel. This significanly accelerates DMA >>>> +operations for PPC KVM guests. >>>> + >>>> +Unlike other capabilities in this section, this one does not have an ioctl. >>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and >>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to >>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE >>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present). >>> >> >>> While this describes perfectly well what the consequences are of the >>> patches, it does not describe properly what the CAP actually expresses. >>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and >>> H_STUFF_TCE hypercalls directly". All other consequences are nice to >>> document, but the semantics of the CAP are missing. >> >> >> ? It expresses ability to handle 2 hcalls. What is missing? > > You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap. This file does not mention qemu at all. And the interface is - qemu (or kvmtool could do that) just adds "hcall-multi-tce" to "ibm,hypertas-functions" but this is for pseries linux and AIX could always do it (no idea about it). Does it really have to be in this file? >>> We also usually try to keep KVM behavior unchanged with regards to older >>> versions until a CAP is enabled. In this case I don't think it matters >>> all that much, so I'm fine with declaring it as enabled by default. >>> Please document that this is a change in behavior versus older KVM >>> versions though. >> >> >> Ok! >> >> >>>> + >>>> + >>>> 5. The kvm_run structure >>>> ------------------------ >>>> >>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h >>>> index af326cd..85d8f26 100644 >>>> --- a/arch/powerpc/include/asm/kvm_host.h >>>> +++ b/arch/powerpc/include/asm/kvm_host.h >>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch { >>>> spinlock_t tbacct_lock; >>>> u64 busy_stolen; >>>> u64 busy_preempt; >>>> + >>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */ >>>> #endif >>>> }; >>> >>> [...] >>>> >>>> >>> >>> [...] >>> >>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >>>> index 550f592..a39039a 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv.c >>>> +++ b/arch/powerpc/kvm/book3s_hv.c >>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) >>>> ret = kvmppc_xics_hcall(vcpu, req); >>>> break; >>>> } /* fallthrough */ >>> >>> The fallthrough comment isn't accurate anymore. >>> >>>> + return RESUME_HOST; >>>> + case H_PUT_TCE: >>>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_PUT_TCE_INDIRECT: >>>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> + case H_STUFF_TCE: >>>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4), >>>> + kvmppc_get_gpr(vcpu, 5), >>>> + kvmppc_get_gpr(vcpu, 6), >>>> + kvmppc_get_gpr(vcpu, 7)); >>>> + if (ret == H_TOO_HARD) >>>> + return RESUME_HOST; >>>> + break; >>>> default: >>>> return RESUME_HOST; >>>> } >>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id) >>>> vcpu->arch.cpu_type = KVM_CPU_3S_64; >>>> kvmppc_sanity_check(vcpu); >>>> >>>> + /* >>>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT >>>> + * half executed, we first read TCEs from the user, check them and >>>> + * return error if something went wrong and only then put TCEs into >>>> + * the TCE table. >>>> + * >>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or >>>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes >>>> + * each (4096 bytes). >>>> + */ >>>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL); >>>> + if (!vcpu->arch.tce_tmp) >>>> + goto free_vcpu; >>>> + >>>> return vcpu; >>>> >>>> free_vcpu: >>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu) >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow); >>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa); >>>> spin_unlock(&vcpu->arch.vpa_update_lock); >>>> + kfree(vcpu->arch.tce_tmp); >>>> kvm_vcpu_uninit(vcpu); >>>> kmem_cache_free(kvm_vcpu_cache, vcpu); >>>> } >>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> index b02f91e..d35554e 100644 >>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S >>>> @@ -1490,6 +1490,12 @@ hcall_real_table: >>>> .long 0 /* 0x11c */ >>>> .long 0 /* 0x120 */ >>>> .long .kvmppc_h_bulk_remove - hcall_real_table >>>> + .long 0 /* 0x128 */ >>>> + .long 0 /* 0x12c */ >>>> + .long 0 /* 0x130 */ >>>> + .long 0 /* 0x134 */ >>>> + .long .kvmppc_h_stuff_tce - hcall_real_table >>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table >>>> hcall_real_table_end: >>>> >>>> ignore_hdec: >>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c >>>> index da0e0bc..91d4b45 100644 >>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c >>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c >>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu) >>>> unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> long rc; >>>> >>>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce); >>>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce); >>>> + if (rc == H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba, >>>> + tce, npages); >>>> + if (rc == H_TOO_HARD) >>>> + return EMULATE_FAIL; >>>> + kvmppc_set_gpr(vcpu, 3, rc); >>>> + return EMULATE_DONE; >>>> +} >>>> + >>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu) >>>> +{ >>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4); >>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5); >>>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6); >>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7); >>>> + long rc; >>>> + >>>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages); >>>> if (rc == H_TOO_HARD) >>>> return EMULATE_FAIL; >>>> kvmppc_set_gpr(vcpu, 3, rc); >>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd) >>>> return kvmppc_h_pr_bulk_remove(vcpu); >>>> case H_PUT_TCE: >>>> return kvmppc_h_pr_put_tce(vcpu); >>>> + case H_PUT_TCE_INDIRECT: >>>> + return kvmppc_h_pr_put_tce_indirect(vcpu); >>>> + case H_STUFF_TCE: >>>> + return kvmppc_h_pr_stuff_tce(vcpu); >>>> case H_CEDE: >>>> vcpu->arch.shared->msr |= MSR_EE; >>>> kvm_vcpu_block(vcpu); >>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c >>>> index 6316ee3..8465c2a 100644 >>>> --- a/arch/powerpc/kvm/powerpc.c >>>> +++ b/arch/powerpc/kvm/powerpc.c >>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext) >>>> r = 1; >>>> break; >>>> #endif >>>> + case KVM_CAP_SPAPR_MULTITCE: >>>> + r = 1; >>> >>> This should only be true for book3s. >> >> >> We had this discussion with v2. >> >> David: >> === >> So, in the case of MULTITCE, that's not quite right. PR KVM can >> emulate a PAPR system on a BookE machine, and there's no reason not to >> allow TCE acceleration as well. We can't make it dependent on PAPR >> mode being selected, because that's enabled per-vcpu, whereas these >> capabilities are queried on the VM before the vcpus are created. >> === >> >> Wrong? > Partially. BookE can not emulate a PAPR system as it stands today. Oh. Ok. So - #ifdef CONFIG_PPC_BOOK3S_64 ? Or run-time check for book3s (how...)? -- Alexey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html