On 04/28/2016 10:43 PM, Michael S. Tsirkin wrote: > On Thu, Apr 28, 2016 at 02:37:16PM +0800, Jason Wang wrote: >> >> On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote: >>> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote: >>>> This patch tries to implement an device IOTLB for vhost. This could be >>>> used with for co-operation with userspace(qemu) implementation of DMA >>>> remapping. >>>> >>>> The idea is simple. When vhost meets an IOTLB miss, it will request >>>> the assistance of userspace to do the translation, this is done >>>> through: >>>> >>>> - Fill the translation request in a preset userspace address (This >>>> address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY). >>>> - Notify userspace through eventfd (This eventfd was set through ioctl >>>> VHOST_SET_IOTLB_FD). >>> Why use an eventfd for this? >> The aim is to implement the API all through ioctls. >> >>> We use them for interrupts because >>> that happens to be what kvm wants, but here - why don't we >>> just add a generic support for reading out events >>> on the vhost fd itself? >> I've considered this approach, but what's the advantages of this? I mean >> looks like all other ioctls could be done through vhost fd >> reading/writing too. > read/write have a non-blocking flag. > > It's not useful for other ioctls but it's useful here. > Ok, this looks better. >>>> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl >>>> >>>> When userspace finishes the translation, it will update the vhost >>>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of >>>> snooping the IOTLB invalidation of IOMMU IOTLB and use >>>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost. >>> There's one problem here, and that is that VQs still do not undergo >>> translation. In theory VQ could be mapped in such a way >>> that it's not contigious in userspace memory. >> I'm not sure I get the issue, current vhost API support setting >> desc_user_addr, used_user_addr and avail_user_addr independently. So >> looks ok? If not, looks not a problem to device IOTLB API itself. > The problem is that addresses are all HVA. > > Without an iommu, we ask for them to be contigious and > since bus address == GPA, this means contigious GPA => > contigious HVA. With an IOMMU you can map contigious > bus address but non contigious GPA and non contigious HVA. Yes, so the issue is we should not reuse VHOST_SET_VRING_ADDR and invent a new ioctl to set bus addr (guest iova). The access the VQ through device IOTLB too. > > Another concern: what if guest changes the GPA while keeping bus address > constant? Normal devices will work because they only use > bus addresses, but virtio will break. If we access VQ through device IOTLB too, this could be solved. > > > >>> >>>> Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx> >>> What limits amount of entries that kernel keeps around? >> It depends on guest working set I think. Looking at >> http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html: >> >> - For 2MB page size in guest, it suggests hugepages=1024 >> - For 1GB page size, it suggests a hugepages=4 >> >> So I choose 2048 to make sure it can cover this. > 4K page size is rather common, too. I assume hugepages is used widely, and there's a note in the above link: "For 64-bit applications, it is recommended to use 1 GB hugepages if the platform supports them." For 4K case, the TLB hit rate will be very low for a large working set even in a physical environment. Not sure we should care, if we want, we probably can cache more translations in userspace's device IOTLB implementation. > >>> Do we want at least a mod parameter for this? >> Maybe. >> >>>> --- >>>> drivers/vhost/net.c | 6 +- >>>> drivers/vhost/vhost.c | 301 +++++++++++++++++++++++++++++++++++++++------ >>>> drivers/vhost/vhost.h | 17 ++- >>>> fs/eventfd.c | 3 +- >>>> include/uapi/linux/vhost.h | 35 ++++++ >>>> 5 files changed, 320 insertions(+), 42 deletions(-) >>>> >> [...] >> >>>> +struct vhost_iotlb_entry { >>>> + __u64 iova; >>>> + __u64 size; >>>> + __u64 userspace_addr; >>> Alignment requirements? >> The API does not require any alignment. Will add a comment for this. >> >>>> + struct { >>>> +#define VHOST_ACCESS_RO 0x1 >>>> +#define VHOST_ACCESS_WO 0x2 >>>> +#define VHOST_ACCESS_RW 0x3 >>>> + __u8 perm; >>>> +#define VHOST_IOTLB_MISS 1 >>>> +#define VHOST_IOTLB_UPDATE 2 >>>> +#define VHOST_IOTLB_INVALIDATE 3 >>>> + __u8 type; >>>> +#define VHOST_IOTLB_INVALID 0x1 >>>> +#define VHOST_IOTLB_VALID 0x2 >>>> + __u8 valid; >>> why do we need this flag? >> Useless, will remove. >> >>>> + __u8 u8_padding; >>>> + __u32 padding; >>>> + } flags; >>>> +}; >>>> + >>>> +struct vhost_vring_iotlb_entry { >>>> + unsigned int index; >>>> + __u64 userspace_addr; >>>> +}; >>>> + >>>> struct vhost_memory_region { >>>> __u64 guest_phys_addr; >>>> __u64 memory_size; /* bytes */ >>>> @@ -127,6 +153,15 @@ struct vhost_memory { >>>> /* Set eventfd to signal an error */ >>>> #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file) >>>> >>>> +/* IOTLB */ >>>> +/* Specify an eventfd file descriptor to signle on IOTLB miss */ >>> typo >> Will fix it. >> >>>> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct \ >>>> + vhost_vring_file) >>>> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct \ >>>> + vhost_vring_iotlb_entry) >>>> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry) >>>> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int) >>>> + >>> Is the assumption that userspace must dedicate a thread to running the iotlb? >>> I dislike this one. >>> Please support asynchronous APIs at least optionally to make >>> userspace make its own threading decisions. >> Nope, my qemu patches does not use a dedicated thread. This API is used >> to start or top DMAR according to e.g whether guest enable DMAR for >> intel IOMMU. > I see. Seems rather confusing - do we need to start/stop it > while device is running? Technically, guest driver (e.g intel ioomu) can stop DMAR at any time. > >>>> /* VHOST_NET specific defines */ >>>> >>>> /* Attach virtio net ring to a raw socket, or tap device. >>> Don't we need a feature bit for this? >> Yes we need it. The feature bit is not considered in this patch and >> looks like it was still under discussion. After we finalize it, I will add. >> >>> Are we short on feature bits? If yes maybe it's time to add >>> something like PROTOCOL_FEATURES that we have in vhost-user. >>> >> I believe it can just work like VERSION_1, or is there anything I missed? > VERSION_1 is a virtio feature though. This one would be backend specific > ... Any differences? Consider we want feature to be something like VIRTIO_F_HOST_IOMMU, vhost could just add this to VHOST_FEATURES? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html