On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote: > On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote: >> This patch tries to implement an device IOTLB for vhost. This could be >> used with for co-operation with userspace(qemu) implementation of DMA >> remapping. >> >> The idea is simple. When vhost meets an IOTLB miss, it will request >> the assistance of userspace to do the translation, this is done >> through: >> >> - Fill the translation request in a preset userspace address (This >> address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY). >> - Notify userspace through eventfd (This eventfd was set through ioctl >> VHOST_SET_IOTLB_FD). > Why use an eventfd for this? The aim is to implement the API all through ioctls. > We use them for interrupts because > that happens to be what kvm wants, but here - why don't we > just add a generic support for reading out events > on the vhost fd itself? I've considered this approach, but what's the advantages of this? I mean looks like all other ioctls could be done through vhost fd reading/writing too. > >> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl >> >> When userspace finishes the translation, it will update the vhost >> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of >> snooping the IOTLB invalidation of IOMMU IOTLB and use >> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost. > There's one problem here, and that is that VQs still do not undergo > translation. In theory VQ could be mapped in such a way > that it's not contigious in userspace memory. I'm not sure I get the issue, current vhost API support setting desc_user_addr, used_user_addr and avail_user_addr independently. So looks ok? If not, looks not a problem to device IOTLB API itself. > > >> Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx> > What limits amount of entries that kernel keeps around? It depends on guest working set I think. Looking at http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html: - For 2MB page size in guest, it suggests hugepages=1024 - For 1GB page size, it suggests a hugepages=4 So I choose 2048 to make sure it can cover this. > Do we want at least a mod parameter for this? Maybe. > >> --- >> drivers/vhost/net.c | 6 +- >> drivers/vhost/vhost.c | 301 +++++++++++++++++++++++++++++++++++++++------ >> drivers/vhost/vhost.h | 17 ++- >> fs/eventfd.c | 3 +- >> include/uapi/linux/vhost.h | 35 ++++++ >> 5 files changed, 320 insertions(+), 42 deletions(-) >> [...] >> +struct vhost_iotlb_entry { >> + __u64 iova; >> + __u64 size; >> + __u64 userspace_addr; > Alignment requirements? The API does not require any alignment. Will add a comment for this. > >> + struct { >> +#define VHOST_ACCESS_RO 0x1 >> +#define VHOST_ACCESS_WO 0x2 >> +#define VHOST_ACCESS_RW 0x3 >> + __u8 perm; >> +#define VHOST_IOTLB_MISS 1 >> +#define VHOST_IOTLB_UPDATE 2 >> +#define VHOST_IOTLB_INVALIDATE 3 >> + __u8 type; >> +#define VHOST_IOTLB_INVALID 0x1 >> +#define VHOST_IOTLB_VALID 0x2 >> + __u8 valid; > why do we need this flag? Useless, will remove. > >> + __u8 u8_padding; >> + __u32 padding; >> + } flags; >> +}; >> + >> +struct vhost_vring_iotlb_entry { >> + unsigned int index; >> + __u64 userspace_addr; >> +}; >> + >> struct vhost_memory_region { >> __u64 guest_phys_addr; >> __u64 memory_size; /* bytes */ >> @@ -127,6 +153,15 @@ struct vhost_memory { >> /* Set eventfd to signal an error */ >> #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file) >> >> +/* IOTLB */ >> +/* Specify an eventfd file descriptor to signle on IOTLB miss */ > typo Will fix it. > >> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct \ >> + vhost_vring_file) >> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct \ >> + vhost_vring_iotlb_entry) >> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry) >> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int) >> + > Is the assumption that userspace must dedicate a thread to running the iotlb? > I dislike this one. > Please support asynchronous APIs at least optionally to make > userspace make its own threading decisions. Nope, my qemu patches does not use a dedicated thread. This API is used to start or top DMAR according to e.g whether guest enable DMAR for intel IOMMU. > >> /* VHOST_NET specific defines */ >> >> /* Attach virtio net ring to a raw socket, or tap device. > Don't we need a feature bit for this? Yes we need it. The feature bit is not considered in this patch and looks like it was still under discussion. After we finalize it, I will add. > Are we short on feature bits? If yes maybe it's time to add > something like PROTOCOL_FEATURES that we have in vhost-user. > I believe it can just work like VERSION_1, or is there anything I missed? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html