On 18/12/2019 08:06, Leonardo Bras wrote: > Fixes a bug that happens when a virtual machine is created without DDW, > with vhost supporting a virtio-net device. > > In this scenario, an IOMMU with 32-bit DMA window will possibly map > IOVA's to different memory addresses. > > As the code works today, H_STUFF_TCE hypercall will be dealt only with > kvm code, which does not invalidate the IOTLB entry in vhost, meaning > that at some point, and old entry can cause an access to a previous > memory address that IOVA pointed. > > Example: > - virtio-net passes IOVA N to vhost, which point to M1 > - vhost tries IOTLB, but miss > - vhost translates IOVA N and stores result to IOTLB > - vhost writes to M1 > - (some IOMMU usage) > - virtio-net passes IOVA N to vhost, which now points to M2 > - vhost tries IOTLB, and translates IOVA N to M1 > - vhost writes to M1 <error, should write to M2> > > The reason why this error was not so evident, is probably because the > IOTLB was small enough to almost always miss at the point an IOVA was > reused. Raising the IOTLB size to 32k (which is a module parameter that > defaults to 2k) is enough to reproduce the bug in +90% of the runs. > It usually takes less than 10 seconds of netperf to cause this bug > to happen. > > A few minutes after reproducing this bug, the guest usually crash. > > Fixing this bug involves cleaning a IOVA entry from IOTLB. > The guest kernel trigger this by doing a H_STUFF_TCE hypercall with > tce_value == 0. > > This change fixes this bug by returning H_TOO_HARD on kvmppc_h_stuff_tce > when tce_value == 0, which causes kvm to let qemu deal with this. > In this case, qemu does free the vhost IOTLB entry, which fixes the bug. > > Signed-off-by: Leonardo Bras <leonardo@xxxxxxxxxxxxx> > --- > arch/powerpc/kvm/book3s_64_vio.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c > index 883a66e76638..841eff3f6392 100644 > --- a/arch/powerpc/kvm/book3s_64_vio.c > +++ b/arch/powerpc/kvm/book3s_64_vio.c > @@ -710,6 +710,9 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, > if (ret != H_SUCCESS) > return ret; > > + if (tce_value == 0) H_STUFF_TCE is always called with 0. Well, may be some AIX somewhere calls it with a value other than zero, and I probably saw some other value somewhere but in QEMU/KVM case it is 0 so you effectively disable in-kernel acceleration of H_STUFF_TCE which is undesirable. For now we should disable in-kernel H_STUFF_TCE/... handlers in QEMU just like we do for VFIO for older host kernels: https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_iommu.c;h=3d3bcc86496a5277d62f7855fbb09c013c015f27;hb=HEAD#l208 I am not sure what a proper solution would be, something like an eventfd and KVM's kvmppc_h_stuff_tce() signalling vhost that the latter needs to invalidate iotlbs. Or we can just say that we do not allow KVM acceleration if there is vhost+iommu on the same liobn (== vPHB, pretty much). Thanks, > + return H_TOO_HARD; > + > /* Check permission bits only to allow userspace poison TCE for debug */ > if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)) > return H_PARAMETER; > -- Alexey