On Wed, 2016-01-27 at 01:14 -0800, Neo Jia wrote: > On Tue, Jan 26, 2016 at 04:30:38PM -0700, Alex Williamson wrote: > > On Tue, 2016-01-26 at 14:28 -0800, Neo Jia wrote: > > > On Tue, Jan 26, 2016 at 01:06:13PM -0700, Alex Williamson wrote: > > > > > 1.1 Under per-physical device sysfs: > > > > > ---------------------------------------------------------------------------------- > > > > > > > > > > vgpu_supported_types - RO, list the current supported virtual GPU types and its > > > > > VGPU_ID. VGPU_ID - a vGPU type identifier returned from reads of > > > > > "vgpu_supported_types". > > > > > > > > > > vgpu_create - WO, input syntax <VM_UUID:idx:VGPU_ID>, create a virtual > > > > > gpu device on a target physical GPU. idx: virtual device index inside a VM > > > > > > > > > > vgpu_destroy - WO, input syntax <VM_UUID:idx>, destroy a virtual gpu device on a > > > > > target physical GPU > > > > > > > > > > > > I've noted in previous discussions that we need to separate user policy > > > > from kernel policy here, the kernel policy should not require a "VM > > > > UUID". A UUID simply represents a set of one or more devices and an > > > > index picks the device within the set. Whether that UUID matches a VM > > > > or is independently used is up to the user policy when creating the > > > > device. > > > > > > > > Personally I'd also prefer to get rid of the concept of indexes within a > > > > UUID set of devices and instead have each device be independent. This > > > > seems to be an imposition on the nvidia implementation into the kernel > > > > interface design. > > > > > > > > > > Hi Alex, > > > > > > I agree with you that we should not put UUID concept into a kernel API. At > > > this point (without any prototyping), I am thinking of using a list of virtual > > > devices instead of UUID. > > > > Hi Neo, > > > > A UUID is a perfectly fine name, so long as we let it be just a UUID and > > not the UUID matching some specific use case. > > > > > > > > > > > > int vgpu_map_virtual_bar > > > > > ( > > > > > uint64_t virt_bar_addr, > > > > > uint64_t phys_bar_addr, > > > > > uint32_t len, > > > > > uint32_t flags > > > > > ) > > > > > > > > > > EXPORT_SYMBOL(vgpu_map_virtual_bar); > > > > > > > > > > > > Per the implementation provided, this needs to be implemented in the > > > > vfio device driver, not in the iommu interface. Finding the DMA mapping > > > > of the device and replacing it is wrong. It should be remapped at the > > > > vfio device file interface using vm_ops. > > > > > > > > > > So you are basically suggesting that we are going to take a mmap fault and > > > within that fault handler, we will go into vendor driver to look up the > > > "pre-registered" mapping and remap there. > > > > > > Is my understanding correct? > > > > Essentially, hopefully the vendor driver will have already registered > > the backing for the mmap prior to the fault, but either way could work. > > I think the key though is that you want to remap it onto the vma > > accessing the vfio device file, not scanning it out of an IOVA mapping > > that might be dynamic and doing a vma lookup based on the point in time > > mapping of the BAR. The latter doesn't give me much confidence that > > mappings couldn't change while the former should be a one time fault. > > Hi Alex, > > The fact is that the vendor driver can only prevent such mmap fault by looking > up the <iova, hva> mapping table that we have saved from IOMMU memory listerner Why do we need to prevent the fault? We need to handle the fault when it occurs. > when the guest region gets programmed. Also, like you have mentioned below, such > mapping between iova and hva shouldn't be changed as long as the SBIOS and > guest OS are done with their job. But you don't know they're done with their job. > Yes, you are right it is one time fault, but the gpu work is heavily pipelined. Why does that matter? We're talking about the first time the VM accesses the range of the BAR that will be direct mapped to the physical GPU. This isn't going to happen in the middle of a benchmark, it's going to happen during driver initialization in the guest. > Probably we should just limit this interface to guest MMIO region and we can have > some crosscheck between the VFIO driver who has monitored the config spcae > access to make sure nothing getting moved around? No, the solution for the bar is very clear, map on fault to the vma accessing the mmap and be done with it for the remainder of this instance of the VM. > > In case it's not clear to folks at Intel, the purpose of this is that a > > vGPU may directly map a segment of the physical GPU MMIO space, but we > > may not know what segment that is at setup time, when QEMU does an mmap > > of the vfio device file descriptor. The thought is that we can create > > an invalid mapping when QEMU calls mmap(), knowing that it won't be > > accessed until later, then we can fault in the real mmap on demand. Do > > you need anything similar? > > > > > > > > > > > int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count) > > > > > > > > > > EXPORT_SYMBOL(vgpu_dma_do_translate); > > > > > > > > > > Still a lot to be added and modified, such as supporting multiple VMs and > > > > > multiple virtual devices, tracking the mapped / pinned region within VGPU IOMMU > > > > > kernel driver, error handling, roll-back and locked memory size per user, etc. > > > > > > > > Particularly, handling of mapping changes is completely missing. This > > > > cannot be a point in time translation, the user is free to remap > > > > addresses whenever they wish and device translations need to be updated > > > > accordingly. > > > > > > > > > > When you say "user", do you mean the QEMU? > > > > vfio is a generic userspace driver interface, QEMU is a very, very > > important user of the interface, but not the only user. So for this > > conversation, we're mostly talking about QEMU as the user, but we should > > be careful about assuming QEMU is the only user. > > > > Understood. I have to say that our focus at this moment is to support QEMU and > KVM, but I know VFIO interface is much more than that, and that is why I think > it is right to leverage this framework so we can together explore future use > case in the userland. > > > > > Here, whenever the DMA that > > > the guest driver is going to launch will be first pinned within VM, and then > > > registered to QEMU, therefore the IOMMU memory listener, eventually the pages > > > will be pinned by the GPU or DMA engine. > > > > > > Since we are keeping the upper level code same, thinking about passthru case, > > > where the GPU has already put the real IOVA into his PTEs, I don't know how QEMU > > > can change that mapping without causing an IOMMU fault on a active DMA device. > > > > For the virtual BAR mapping above, it's easy to imagine that mapping a > > BAR to a given address is at the guest discretion, it may be mapped and > > unmapped, it may be mapped to different addresses at different points in > > time, the guest BIOS may choose to map it at yet another address, etc. > > So if somehow we were trying to setup a mapping for peer-to-peer, there > > are lots of ways that IOVA could change. But even with RAM, we can > > support memory hotplug in a VM. What was once a DMA target may be > > removed or may now be backed by something else. Chipset configuration > > on the emulated platform may change how guest physical memory appears > > and that might change between VM boots. > > > > Currently with physical device assignment the memory listener watches > > for both maps and unmaps and updates the iotlb to match. Just like real > > hardware doing these same sorts of things, we rely on the guest to stop > > using memory that's going to be moved as a DMA target prior to moving > > it. > > Right, you can only do that when the device is quiescent. > > As long as this will be notified to the guest, I think we should be able to > support it although the real implementation will depend on how the device gets into > quiescent state. > > This is definitely a very interesting feature we should explore, but I hope we > probably can first focus on the most basic functionality. If we only do a point-in-time translation and assume it never changes, that's good enough for a proof of concept, but it's not a complete solution. I think this is practical problem, not just an academic problem. There needs to be a mechanism mappings to be invalidated based on VM memory changes. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html