> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Saturday, May 29, 2021 3:59 AM > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote: > > > > 5. Use Cases and Flows > > > > Here assume VFIO will support a new model where every bound device > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o > > going through legacy container/group interface. For illustration purpose > > those devices are just called dev[1...N]: > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > As explained earlier, one IOASID fd is sufficient for all intended use cases: > > > > ioasid_fd = open("/dev/ioasid", mode); > > > > For simplicity below examples are all made for the virtualization story. > > They are representative and could be easily adapted to a non-virtualization > > scenario. > > For others, I don't think this is *strictly* necessary, we can > probably still get to the device_fd using the group_fd and fit in > /dev/ioasid. It does make the rest of this more readable though. Jason, want to confirm here. Per earlier discussion we remain an impression that you want VFIO to be a pure device driver thus container/group are used only for legacy application. From this comment are you suggesting that VFIO can still keep container/ group concepts and user just deprecates the use of vfio iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a simple policy that an IOASID will reject cmd if partially-attached group exists)? > > > > Three types of IOASIDs are considered: > > > > gpa_ioasid[1...N]: for GPA address space > > giova_ioasid[1...N]: for guest IOVA address space > > gva_ioasid[1...N]: for guest CPU VA address space > > > > At least one gpa_ioasid must always be created per guest, while the other > > two are relevant as far as vIOMMU is concerned. > > > > Examples here apply to both pdev and mdev, if not explicitly marked out > > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the > > associated routing information in the attaching operation. > > > > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_ > > INFO are skipped in these examples. > > > > 5.1. A simple example > > ++++++++++++++++++ > > > > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address > > space is managed through DMA mapping protocol: > > > > /* Bind device to IOASID fd */ > > device_fd = open("/dev/vfio/devices/dev1", mode); > > ioasid_fd = open("/dev/ioasid", mode); > > ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd); > > > > /* Attach device to IOASID */ > > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > > at_data = { .ioasid = gpa_ioasid}; > > ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data); > > > > /* Setup GPA mapping */ > > dma_map = { > > .ioasid = gpa_ioasid; > > .iova = 0; // GPA > > .vaddr = 0x40000000; // HVA > > .size = 1GB; > > }; > > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map); > > > > If the guest is assigned with more than dev1, user follows above sequence > > to attach other devices to the same gpa_ioasid i.e. sharing the GPA > > address space cross all assigned devices. > > eg > > device2_fd = open("/dev/vfio/devices/dev1", mode); > ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd); > ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data); > > Right? Exactly, except a small typo ('dev1' -> 'dev2'). :) > > > > > 5.2. Multiple IOASIDs (no nesting) > > ++++++++++++++++++++++++++++ > > > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially > > both devices are attached to gpa_ioasid. After boot the guest creates > > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass > > through mode (gpa_ioasid). > > > > Suppose IOASID nesting is not supported in this case. Qemu need to > > generate shadow mappings in userspace for giova_ioasid (like how > > VFIO works today). > > > > To avoid duplicated locked page accounting, it's recommended to pre- > > register the virtual address range that will be used for DMA: > > > > device_fd1 = open("/dev/vfio/devices/dev1", mode); > > device_fd2 = open("/dev/vfio/devices/dev2", mode); > > ioasid_fd = open("/dev/ioasid", mode); > > ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd); > > ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd); > > > > /* pre-register the virtual address range for accounting */ > > mem_info = { .vaddr = 0x40000000; .size = 1GB }; > > ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info); > > > > /* Attach dev1 and dev2 to gpa_ioasid */ > > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > > at_data = { .ioasid = gpa_ioasid}; > > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data); > > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); > > > > /* Setup GPA mapping */ > > dma_map = { > > .ioasid = gpa_ioasid; > > .iova = 0; // GPA > > .vaddr = 0x40000000; // HVA > > .size = 1GB; > > }; > > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map); > > > > /* After boot, guest enables an GIOVA space for dev2 */ > > giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > > > > /* First detach dev2 from previous address space */ > > at_data = { .ioasid = gpa_ioasid}; > > ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data); > > > > /* Then attach dev2 to the new address space */ > > at_data = { .ioasid = giova_ioasid}; > > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); > > > > /* Setup a shadow DMA mapping according to vIOMMU > > * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000) > > */ > > Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW > IOMMU? 'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000) > > > dma_map = { > > .ioasid = giova_ioasid; > > .iova = 0x2000; // GIOVA > > .vaddr = 0x40001000; // HVA > > eg HVA came from reading the guest's page tables and finding it wanted > GPA 0x1000 mapped to IOVA 0x2000? yes > > > > 5.3. IOASID nesting (software) > > +++++++++++++++++++++++++ > > > > Same usage scenario as 5.2, with software-based IOASID nesting > > available. In this mode it is the kernel instead of user to create the > > shadow mapping. > > > > The flow before guest boots is same as 5.2, except one point. Because > > giova_ioasid is nested on gpa_ioasid, locked accounting is only > > conducted for gpa_ioasid. So it's not necessary to pre-register virtual > > memory. > > > > To save space we only list the steps after boots (i.e. both dev1/dev2 > > have been attached to gpa_ioasid before guest boots): > > > > /* After boots */ > > /* Make GIOVA space nested on GPA space */ > > giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING, > > gpa_ioasid); > > > > /* Attach dev2 to the new address space (child) > > * Note dev2 is still attached to gpa_ioasid (parent) > > */ > > at_data = { .ioasid = giova_ioasid}; > > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); > > > > /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be > > * merged by the kernel with GPA->HVA mapping of gpa_ioasid > > * to form a shadow mapping. > > */ > > dma_map = { > > .ioasid = giova_ioasid; > > .iova = 0x2000; // GIOVA > > .vaddr = 0x1000; // GPA > > .size = 4KB; > > }; > > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map); > > And in this version the kernel reaches into the parent IOASID's page > tables to translate 0x1000 to 0x40001000 to physical page? So we > basically remove the qemu process address space entirely from this > translation. It does seem convenient yes. > > > 5.4. IOASID nesting (hardware) > > +++++++++++++++++++++++++ > > > > Same usage scenario as 5.2, with hardware-based IOASID nesting > > available. In this mode the pgtable binding protocol is used to > > bind the guest IOVA page table with the IOMMU: > > > > /* After boots */ > > /* Make GIOVA space nested on GPA space */ > > giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING, > > gpa_ioasid); > > > > /* Attach dev2 to the new address space (child) > > * Note dev2 is still attached to gpa_ioasid (parent) > > */ > > at_data = { .ioasid = giova_ioasid}; > > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); > > > > /* Bind guest I/O page table */ > > bind_data = { > > .ioasid = giova_ioasid; > > .addr = giova_pgtable; > > // and format information > > }; > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > I really think you need to use consistent language. Things that > allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple > IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc. > alloc/create/bind is too confusing. > > > 5.5. Guest SVA (vSVA) > > ++++++++++++++++++ > > > > After boots the guest further create a GVA address spaces (gpasid1) on > > dev1. Dev2 is not affected (still attached to giova_ioasid). > > > > As explained in section 4, user should avoid expose ENQCMD on both > > pdev and mdev. > > > > The sequence applies to all device types (being pdev or mdev), except > > one additional step to call KVM for ENQCMD-capable mdev: > > > > /* After boots */ > > /* Make GVA space nested on GPA space */ > > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING, > > gpa_ioasid); > > > > /* Attach dev1 to the new address space and specify vPASID */ > > at_data = { > > .ioasid = gva_ioasid; > > .flag = IOASID_ATTACH_USER_PASID; > > .user_pasid = gpasid1; > > }; > > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data); > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is > there any scenario where we want different vpasid's for the same > IOASID? I guess it is OK like this. Hum. Yes, it's completely sane that the guest links a I/O page table to different vpasids on dev1 and dev2. The IOMMU doesn't mandate that when multiple devices share an I/O page table they must use the same PASID#. > > > /* if dev1 is ENQCMD-capable mdev, update CPU PASID > > * translation structure through KVM > > */ > > pa_data = { > > .ioasid_fd = ioasid_fd; > > .ioasid = gva_ioasid; > > .guest_pasid = gpasid1; > > }; > > ioctl(kvm_fd, KVM_MAP_PASID, &pa_data); > > Make sense > > > /* Bind guest I/O page table */ > > bind_data = { > > .ioasid = gva_ioasid; > > .addr = gva_pgtable1; > > // and format information > > }; > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > Again I do wonder if this should just be part of alloc_ioasid. Is > there any reason to split these things? The only advantage to the > split is the device is known, but the device shouldn't impact > anything.. I summarized this as open#4 in another mail for focused discussion. > > > 5.6. I/O page fault > > +++++++++++++++ > > > > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver > > to guest IOMMU driver and backwards). > > > > - Host IOMMU driver receives a page request with raw fault_data {rid, > > pasid, addr}; > > > > - Host IOMMU driver identifies the faulting I/O page table according to > > information registered by IOASID fault handler; > > > > - IOASID fault handler is called with raw fault_data (rid, pasid, addr), > which > > is saved in ioasid_data->fault_data (used for response); > > > > - IOASID fault handler generates an user fault_data (ioasid, addr), links it > > to the shared ring buffer and triggers eventfd to userspace; > > Here rid should be translated to a labeled device and return the > device label from VFIO_BIND_IOASID_FD. Depending on how the device > bound the label might match to a rid or to a rid,pasid Yes, I acknowledged this input from you and Jean about page fault and bind_pasid_table. I summarized it as open#3 in another mail. thus following is skipped... Thanks Kevin > > > - Upon received event, Qemu needs to find the virtual routing information > > (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are > > multiple, pick a random one. This should be fine since the purpose is to > > fix the I/O page table on the guest; > > The device label should fix this > > > - Qemu finds the pending fault event, converts virtual completion data > > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to > > complete the pending fault; > > > > - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in > > ioasid_data->fault_data, and then calls iommu api to complete it with > > {rid, pasid, response_code}; > > So resuming a fault on an ioasid will resume all devices pending on > the fault? > > > 5.7. BIND_PASID_TABLE > > ++++++++++++++++++++ > > > > PASID table is put in the GPA space on some platform, thus must be > updated > > by the guest. It is treated as another user page table to be bound with the > > IOMMU. > > > > As explained earlier, the user still needs to explicitly bind every user I/O > > page table to the kernel so the same pgtable binding protocol (bind, cache > > invalidate and fault handling) is unified cross platforms. > > > > vIOMMUs may include a caching mode (or paravirtualized way) which, > once > > enabled, requires the guest to invalidate PASID cache for any change on the > > PASID table. This allows Qemu to track the lifespan of guest I/O page tables. > > > > In case of missing such capability, Qemu could enable write-protection on > > the guest PASID table to achieve the same effect. > > > > /* After boots */ > > /* Make vPASID space nested on GPA space */ > > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING, > > gpa_ioasid); > > > > /* Attach dev1 to pasidtbl_ioasid */ > > at_data = { .ioasid = pasidtbl_ioasid}; > > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data); > > > > /* Bind PASID table */ > > bind_data = { > > .ioasid = pasidtbl_ioasid; > > .addr = gpa_pasid_table; > > // and format information > > }; > > ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data); > > > > /* vIOMMU detects a new GVA I/O space created */ > > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING, > > gpa_ioasid); > > > > /* Attach dev1 to the new address space, with gpasid1 */ > > at_data = { > > .ioasid = gva_ioasid; > > .flag = IOASID_ATTACH_USER_PASID; > > .user_pasid = gpasid1; > > }; > > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data); > > > > /* Bind guest I/O page table. Because SET_PASID_TABLE has been > > * used, the kernel will not update the PASID table. Instead, just > > * track the bound I/O page table for handling invalidation and > > * I/O page faults. > > */ > > bind_data = { > > .ioasid = gva_ioasid; > > .addr = gva_pgtable1; > > // and format information > > }; > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data); > > I still don't quite get the benifit from doing this. > > The idea to create an all PASID IOASID seems to work better with less > fuss on HW that is directly parsing the guest's PASID table. > > Cache invalidate seems easy enough to support > > Fault handling needs to return the (ioasid, device_label, pasid) when > working with this kind of ioasid. > > It is true that it does create an additional flow qemu has to > implement, but it does directly mirror the HW. > > Jason