On Wed, Feb 05, 2020 at 03:42:18PM +0800, Jason Wang wrote: > > On 2020/2/5 下午3:16, Michael S. Tsirkin wrote: > > On Wed, Feb 05, 2020 at 02:49:31PM +0800, Jason Wang wrote: > > > On 2020/2/5 下午2:30, Michael S. Tsirkin wrote: > > > > On Wed, Feb 05, 2020 at 01:50:28PM +0800, Jason Wang wrote: > > > > > On 2020/2/5 下午1:31, Michael S. Tsirkin wrote: > > > > > > On Wed, Feb 05, 2020 at 11:12:21AM +0800, Jason Wang wrote: > > > > > > > On 2020/2/5 上午10:05, Tiwei Bie wrote: > > > > > > > > On Tue, Feb 04, 2020 at 02:46:16PM +0800, Jason Wang wrote: > > > > > > > > > On 2020/2/4 下午2:01, Michael S. Tsirkin wrote: > > > > > > > > > > On Tue, Feb 04, 2020 at 11:30:11AM +0800, Jason Wang wrote: > > > > > > > > > > > 5) generate diffs of memory table and using IOMMU API to setup the dma > > > > > > > > > > > mapping in this method > > > > > > > > > > Frankly I think that's a bunch of work. Why not a MAP/UNMAP interface? > > > > > > > > > > > > > > > > > > > Sure, so that basically VHOST_IOTLB_UPDATE/INVALIDATE I think? > > > > > > > > Do you mean we let userspace to only use VHOST_IOTLB_UPDATE/INVALIDATE > > > > > > > > to do the DMA mapping in vhost-vdpa case? When vIOMMU isn't available, > > > > > > > > userspace will set msg->iova to GPA, otherwise userspace will set > > > > > > > > msg->iova to GIOVA, and vhost-vdpa module will get HPA from msg->uaddr? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Tiwei > > > > > > > I think so. Michael, do you think this makes sense? > > > > > > > > > > > > > > Thanks > > > > > > to make sure, could you post the suggested argument format for > > > > > > these ioctls? > > > > > > > > > > > It's the existed uapi: > > > > > > > > > > /* no alignment requirement */ > > > > > struct vhost_iotlb_msg { > > > > > __u64 iova; > > > > > __u64 size; > > > > > __u64 uaddr; > > > > > #define VHOST_ACCESS_RO 0x1 > > > > > #define VHOST_ACCESS_WO 0x2 > > > > > #define VHOST_ACCESS_RW 0x3 > > > > > __u8 perm; > > > > > #define VHOST_IOTLB_MISS 1 > > > > > #define VHOST_IOTLB_UPDATE 2 > > > > > #define VHOST_IOTLB_INVALIDATE 3 > > > > > #define VHOST_IOTLB_ACCESS_FAIL 4 > > > > > __u8 type; > > > > > }; > > > > > > > > > > #define VHOST_IOTLB_MSG 0x1 > > > > > #define VHOST_IOTLB_MSG_V2 0x2 > > > > > > > > > > struct vhost_msg { > > > > > int type; > > > > > union { > > > > > struct vhost_iotlb_msg iotlb; > > > > > __u8 padding[64]; > > > > > }; > > > > > }; > > > > > > > > > > struct vhost_msg_v2 { > > > > > __u32 type; > > > > > __u32 reserved; > > > > > union { > > > > > struct vhost_iotlb_msg iotlb; > > > > > __u8 padding[64]; > > > > > }; > > > > > }; > > > > Oh ok. So with a real device, I suspect we do not want to wait for each > > > > change to be processed by device completely, so we might want an asynchronous variant > > > > and then some kind of flush that tells device "you better apply these now". > > > > > > Let me explain: > > > > > > There are two types of devices: > > > > > > 1) device without on-chip IOMMU, DMA was done via IOMMU API which only > > > support incremental map/unmap > > Most IOMMUs have queues nowdays though. Whether APIs within kernel > > expose that matters but we are better off on emulating > > hardware not specific guest behaviour. > > > Last time I checked Intel IOMMU driver, I see the async QI is not used > there. And I'm not sure how queue will help much here. Qemu still need to > wait for all the DMA is setup to let guest work. > > > > > > 2) device with on-chip IOMMU, DMA could be done by device driver itself, and > > > we could choose to pass the whole mappings to the driver at one time through > > > vDPA bus operation (set_map) > > > > > > For vhost-vpda, there're two types of memory mapping: > > > > > > a) memory table, setup by userspace through VHOST_SET_MEM_TABLE, the whole > > > mapping is updated in this way > > > b) IOTLB API, incrementally done by userspace through vhost message > > > (IOTLB_UPDATE/IOTLB_INVALIDATE) > > > > > > The current design is: > > > > > > - Reuse VHOST_SET_MEM_TABLE, and for type 1), we can choose to send diffs > > > through IOMMU API or flush all the mappings then map new ones. For type 2), > > > just send the whole mapping through set_map() > > I know that at least for RDMA based things, you can't change > > a mapping if it's active. So drivers will need to figure out the > > differences which just looks ugly: userspace knows what > > it was changing (really just adding/removing some guest memory). > > > Two methods: > > 1) using IOTLB message VHOST_IOTLB_UPDATE/INVALIDATE > 2) let vhost differs from two memory tables which should not be too hard > (compare two rb trees) Right but 2 is just such an obvious waste of cyclces. userspace knows what changed why does vhost need to re-calculate it? No? > > > > > > > > > > - Reuse vhost IOTLB, so for type 1), simply forward update/invalidate > > > request via IOMMU API, for type 2), send IOTLB to vDPA device driver via > > > set_map(), device driver may choose to send diffs or rebuild all mapping at > > > their will > > > > > > Technically we can use vhost IOTLB API (map/umap) for building > > > VHOST_SET_MEM_TABLE, but to avoid device to process the each request, it > > > looks to me we need new UAPI which seems sub optimal. > > > > > > What's you thought? > > > > > > Thanks > > I suspect we can't completely avoid a new UAPI. > > > AFAIK, memory table usually contain just few entries, the performance cost > should be fine. (At least should be the same as the case of VFIO). > > So in qemu, simply hooking add_region/remove_region to > VHOST_IOTLB_UPDATE/VHOST_IOTLB_INVALIDATE should work? > > If we introduce API like you proposed previously (memory listener style): > > begin > add > remove > commit > > I suspect it will be too heavweight for the case of vIOMMU and for the > driver that want to build new mapping, we need addnop etc... > > Thanks > I feel this can help some workloads but this can wait, for sure. > >