On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote: > On 10/02/17 14:07, David Gibson wrote: > > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote: > >> On 09/02/17 14:51, David Gibson wrote: > >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote: > >>>> For the emulated devices it does not matter much if we get a broken TCE > >>>> half way handling a TCE list but for VFIO it will matter as it has > >>>> more chances to fail so we try to do our best and check as much as we > >>>> can before proceeding. > >>>> > >>>> This separates a guest view table update from validation. No change in > >>>> behavior is expected. > >>>> > >>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> > >>>> --- > >>>> arch/powerpc/kvm/book3s_64_vio.c | 8 ++++++++ > >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++-- > >>>> 2 files changed, 14 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > >>>> index 15df8ae627d9..9a7b7fca5e84 100644 > >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c > >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c > >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, > >>>> ret = kvmppc_tce_validate(stt, tce); > >>>> if (ret != H_SUCCESS) > >>>> goto unlock_exit; > >>>> + } > >>>> + > >>>> + for (i = 0; i < npages; ++i) { > >>>> + if (get_user(tce, tces + i)) { > >>>> + ret = H_TOO_HARD; > >>>> + goto unlock_exit; > >>>> + } > >>>> + tce = be64_to_cpu(tce); > >>> > >>> This doesn't look safe. The contents of user memory could change > >>> between the two get_user()s, meaning that you're no longer guaranteed > >>> a TCE loaded into kernel has been validated at all. > >>> > >>> I think you need to either: > >>> > >>> a) Make sure things safe against a bad TCE being loaded into a TCE > >>> table and move all validation to where the TCE is used, rather > >>> than loaded > >>> > >>> or > >>> b) Copy the whole set of indirect entries to a temporary in-kernel > >>> buffer, then validate, then load into the actual TCE table. > >> > >> > >> Correct :( The problem is I do not know how far I want to go in reverting > >> the state as it was when I started handling H_PUT_TCE_INDIRECT. > >> > >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so - > >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical > >> address. > >> > >> > >> To do a) I'll need to remember old content of each hardware table entry as > >> when I reach TCE#100, I'll need to revert to the initial state which means > >> I need to write back old TCEs to all affected hardware tables and update > >> reference counters of all affected preregistered areas. Well, the actual > >> tables must not have different addresses (BUG_ON? is it worth testing while > >> writing to hardware tables that values I am replacing are the same in all > >> tables?) so I can have just a single array of old TCEs from hardware tables > >> in vcpu. > > > > I thought you said shared tables were disabled, so the two tables > > would have different addresses? > > That would be 2 physically separated tables but the content would be the > same as long as they belong to the same VFIO container. Ok. I thought you were talking about the address of the TCE tables being the same above. Did you mean the address of an individual page mapped in the TCE table? > > Hmm. Now I'm trying to remember, will the gpa->hpa translation fail > > only if the guest/qemu does something wrong, or can it fail for other > > reasons? > > This should always just work. Ok, given that, just replacing HPAs we can't translate with a clear entry seems fine to me. > > What about in real mode vs. virtual mode? > > Real mode is no different in this matter. > > Real mode is different from virtual mode in 3 aspects: > > 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache > inhibited writes to invalidate "TCE kill" cache; > > 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of > lockdep does not work in real mode properly; > > 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd > addresses directly. Not expected to fail. > > This is a full list. Ok. > > I think the key to this approach will be to think carefully about what > > semantics you guarantee for mappings shadowed into the hardware > > tables. For example, it might work to specify that the host mappings > > only match the GPA mappings if those GPA mapings are valid in the > > first place. So, H_PUT_TCE etc. would succeed as long as they're able > > to update the view of the table in terms of GPA. But when you shadow > > those into the HPA tables, any entries which can't be translated you > > just replace with a cleared entry. > > Literally with zero? Silently? WARN_ON_ONCE? Well, with a no-permission TCE, which might as well be zero, yes. WARN_ON_ONCE() is probably a good idea. > > That should be enough to protect > > the host. Obviously you can expect the device to fail when you > > actually attempt to DMA there, but that's the guest's (or qemu's) own > > fault for putting bad addresses in the TCE table. > > > > Obviously that might not be great for debugging, since mappings will > > appear to succeed, but then not work later on. > > > > This does have the nice property that it's reasonably obvious what to > > do if you have some GPA mappings for emulated devices, then hotplug a > > VFIO device and at that point hit a gpa->hpa translation error. > > There's no hcall in this case, so there's no obvious way to return an > > error to the guest. > > Right. So if I do this, you would probably even ack this? :) Assuming I don't spot some other showstopper... Oh.. one thing to make sure you think about though: what happens if a guest makes some mappings, then there's a memory hotplug event which changes the set of valid GPAs? In particular what if you hot unplug some memory which is mapped in a guest TCE table? You might have to regenerate the HPA tables from the GPA table on hot unplug (unless you have a way of locking out an unplug event while that piece of guest ram is TCE mapped). > >> To do b) I'll need: > >> > >> 1. to have a copy of TCEs from the guest in vcpu, > > > > I don't quite understand this. You need a temporary copy, yes, but I > > don't see why it needs to be attached to the vcpu. > > It does not need, I just need a safe + static + lock-free place for it as I > do not want to do malloc() in the TCE handlers and (in theory) multiple > CPUs can do concurrent TCE requests and I want to avoid locking especially > in realmode. Ah, right, it's the inability to malloc() that's the difficulty. You could put it in the vcpu, or you could use a per-(host)-cpu area - you can't switch guests while in a realmode handler. > > > >> I populate it via > >> get_user() to make sure they won't change; > >> 2. an array of userspace addresses translated from given TCEs; and in order > >> to make sure these addresses won't go away, I'll need to reference each > >> preregistered memory area via mm_iommu_mapped_inc(). > >> > >> When I reach TCE#100, I'll have to revert the change, i.e. call > >> mm_iommu_mapped_dec(). > > > > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy > > the updated translations into a temp buffer. Then you'd to make more > > temp buffers to store the UA and HPA translations (although maybe you > > could overwrite/reuse the original temp buffer if you're careful). > > Then only if all of those succeed do you copy them into the real > > hardware tables. > > > > Which sounds like it might be kinda messy, at least in real mode. > > So is it worth it? Option (a) is certainly looking better to me based on current information. > >> So I will end up having 2 arrays in a vcpu and simpler reverting code. > >> > >> > >> Or I can do simpler version of b) which would store guest TCEs in > >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest > >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of > >> request, some preregistered regions will stay referenced till the guest is > >> killed or rebooted (and this will prevent memory from unregistering) - but > >> this means no harm to the host; > > > > Hrm.. that's not really true. It's not the worst thing that can > > happen, but allowing the guest to permanently lock extra chunks of > > memory is a form of harm to the host. > > > These are the same preregistered chunks which are already locked. And the > lock is there till QEMU process is dead. What will not be possible is > memory hotunplug. Ah, ok, I see your point. That's probably sufficient, but option (a) is still looking better. > >> and with preregistered RAM, there is no > >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest. > >> > >> > >> > >> Which approach to pick? > >> > >> > >> LoPAPR says: > >> === > >> If the TCE parameter represents the logical page address of a page that is > >> not valid for the calling partition, return > >> H_Parameter. > >> === > >> > >> > >> > >>>> > >>>> kvmppc_tce_put(stt, entry + i, tce); > >>>> } > >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c > >>>> index 918af76ab2b6..f8a54b7c788e 100644 > >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c > >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c > >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, > >>>> { > >>>> struct kvmppc_spapr_tce_table *stt; > >>>> long i, ret = H_SUCCESS; > >>>> - unsigned long tces, entry, ua = 0; > >>>> + unsigned long tces, entry, tce, ua = 0; > >>>> unsigned long *rmap = NULL; > >>>> > >>>> stt = kvmppc_find_table(vcpu->kvm, liobn); > >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu, > >>>> } > >>>> > >>>> for (i = 0; i < npages; ++i) { > >>>> - unsigned long tce = be64_to_cpu(((u64 *)tces)[i]); > >>>> + tce = be64_to_cpu(((u64 *)tces)[i]); > >>>> > >>>> ret = kvmppc_tce_validate(stt, tce); > >>>> if (ret != H_SUCCESS) > >>>> goto unlock_exit; > >>>> + } > >>>> + > >>>> + for (i = 0; i < npages; ++i) { > >>>> + tce = be64_to_cpu(((u64 *)tces)[i]); > >>> > >>> Same problem here. > >>> > >>>> > >>>> kvmppc_tce_put(stt, entry + i, tce); > >>>> } > >>> > >> > >> > > > > > > > > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature