> -----Original Message----- > From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@xxxxxxx] > Sent: Monday, March 27, 2017 6:14 PM > To: Liu, Yi L <yi.l.liu@xxxxxxxxx>; Alex Williamson <alex.williamson@xxxxxxxxxx> > Cc: Shanker Donthineni <shankerd@xxxxxxxxxxxxxxxx>; kvm@xxxxxxxxxxxxxxx; > Catalin Marinas <catalin.marinas@xxxxxxx>; Sinan Kaya > <okaya@xxxxxxxxxxxxxxxx>; Will Deacon <will.deacon@xxxxxxx>; > iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx; Harv Abdulhamid <harba@xxxxxxxxxxxxxxxx>; > linux-pci@xxxxxxxxxxxxxxx; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; David > Woodhouse <dwmw2@xxxxxxxxxxxxx>; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; Nate > Watterson <nwatters@xxxxxxxxxxxxxxxx>; Tian, Kevin <kevin.tian@xxxxxxxxx>; > Lan, Tianyu <tianyu.lan@xxxxxxxxx>; Raj, Ashok <ashok.raj@xxxxxxxxx>; Pan, Jacob > jun <jacob.jun.pan@xxxxxxxxx>; Joerg Roedel <joro@xxxxxxxxxx>; Robin Murphy > <robin.murphy@xxxxxxx> > Subject: Re: [RFC PATCH 29/30] vfio: Add support for Shared Virtual Memory > > On 24/03/17 07:46, Liu, Yi L wrote: > [...] > >>>> > >>>> So we need some kind of high-level classification that the vIOMMU > >>>> must communicate to the physical one. Each IOMMU flavor would get a > >>>> unique, global identifier, simply to make sure that vIOMMU and > >>>> pIOMMU speak > >> the same language. > >>>> For example: > >>>> > >>>> 0x65776886 "AMDV" AMD IOMMU > >>>> 0x73788476 "INTL" Intel IOMMU > >>>> 0x83515748 "S390" s390 IOMMU > >>>> 0x83777785 "SMMU" ARM SMMU > >>>> etc. > >>>> > >>>> It needs to be a global magic number that everyone can recognize. > >>>> Could be as simple as 32-bit numbers allocated from 0. Once we have > >>>> a global magic number, we can use it to differentiate architecture-specific > details. > > > > I prefer simple numbers to stand for each vendor. > > Sure, I don't have any preference. Simple numbers could be easier to allocate. > > >>> I may need to think more on this part. > >>> > >>>> struct pasid_table_info { > >>>> __u64 ptr; > >>>> __u64 size; /* Is it number of entry or size in > >>>> bytes? */ > >>> > >>> For Intel platform, it's encoded. But I can make it in bytes. Here, > >>> I'd like to check with you if whole guest PASID info is also needed on ARM? > >> > >> It will be needed on ARM if someone ever emulates the SMMU with SVM. > >> Though I'm not planning on doing that myself, it is unavoidable. And > >> it would be a shame for the next SVM virtualization solution to have > >> to introduce a new flag "VFIO_SVM_BIND_PASIDPT_2" if they could reuse > >> most of the BIND_PASIDPT interface but simply needed to add one or > >> two configuration fields specific to their IOMMU. > > > > So you are totally fine with putting PASID table ptr and size in the > > generic part? Maybe we have different usage for it. For me, it's a > > guest PASID table ptr. For you, it may be different. > > It's the same for SMMU, with some added format specifiers that would go in > 'opaque[]'. I think that table pointer and size (in bytes, or number of > entries) is generic enough for a "bind table" call and can be reused by future > implementations. > > >>>> > >>>> __u32 model; /* magic number */ > >>>> __u32 variant; /* version of the IOMMU architecture, > >>>> maybe? IOMMU-specific. */ > > > > For variant, it will be combined with model to do sanity check. Am I right? > > Maybe it could be moved to opaque. > > Yes I guess it could be moved to opaque. It would be a version of the model used, so > we wouldn't have to allocate a new model number whenever an architecture > updates the fields of its PASID descriptors, but we can let IOMMU drivers decide if > they need it and what to put in there. > > >>>> __u8 opaque[]; /* IOMMU-specific details */ > >>>> }; > >>>> > [...] > >> > >> Yes, that seems sensible. I could add an explicit VFIO_BIND_PASID > >> flags to make it explicit that data[] is "u32 pasid" and avoid having any default. > > > > Add it in the comment I suppose. The length is 4 byes, it could be deduced from > argsz. > > > >> > >>>> > >>>>> #define VFIO_SVM_PASSDOWN_INVALIDATE (1 << 1) > >>>> > >>>> Using the vfio_device_svm structure for invalidate operations is a > >>>> bit odd, it might be nicer to add a new VFIO_SVM_INVALIDATE ioctl, > >>>> that takes the above iommu_svm_tlb_invalidate_info as argument > >>>> (with an added argsz.) > >>> > >>> Agree, would add a separate IOCTL for invalidation. > >>> > >>>> > >>>>> #define VFIO_SVM_PASID_RELEASE_FLUSHED (1 << 2) > >>>>> #define VFIO_SVM_PASID_RELEASE_CLEAN (1 << 3) > >>>>> __u32 flags; > >>>>> __u32 length; > >>>> > >>>> If length is the size of data[], I guess we can already deduce this info from argsz. > >>> > >>> yes, it is size of data. Maybe remove argsz. How about your opinion? > >> > >> Argsz as first argument is standard across all VFIO ioctls, it allows > >> the kernel to check that the structure passed by the guest is > >> consistent with the structure it knows (see the comment at the > >> beginning of > >> include/uapi/linux/vfio.h) So argsz would be the preferred way to > >> communicate the size of data[] > > > > yes, it makes sense. BTW. I think it is still possible to leave the > > "length" since there is similar usage. e.g. struct vfio_irq_set, count tells the size of > data[] in that structure. > > Anyhow, it's no big deal here. > > > >>>>> __u8 data[]; > >>>>> }; > >>>> > >>>> In general, I think that your proposal would work fine along mine. > >>>> Note that for my next version of this patch, I would like to move > >>>> the BIND/UNBIND SVM operations onto a VFIO container fd, instead of > >>>> a VFIO device > >> fd, if possible. > >>> > >>> Attach the BIND/UNBIND operation onto VFIO container fd is practical. > >>> > >>> BTW. Before you send out your next version, we'd better have a > >>> consensus on the vfio_device_svm definition. So that we can continue > >>> to drive our > >> own work separately. > >>> > >>>> --- > >>>> As an aside, it also aligns with the one I'm working on at the > >>>> moment, for virtual SVM at a finer granularity, where the BIND call > >>>> is for a page table. I would add this flag to vfio_device_svm: > >>>> > >>>> #define VFIO_SVM_BIND_PGTABLE (1 << x) > >>> > >>> Sure. I think it may also be a requirement from other vendors. I > >>> think you've mentioned it in the above statements. > >>> > >>>> Which would indicate the following data (just a draft, I'm > >>>> oscillating between this and a generic PASID table solution, which > >>>> would instead > >> reuse your proposal): > >>>> > >>>> struct pgtable_info { > >>>> __u32 pasid; > >>>> > >>>> __u64 pgd; > >>>> > >>>> __u32 model; > >>>> __u32 variant; > >>>> > >>>> __u8 opaque[]; > >>>> }; > > > > I think you would have this definition as an ARM specific one. Would > > be filled in the data[] of vfio_device_svm. Is it? > > Yes it would be another kind of data[] for vfio_device_svm. > > >>>> > >>>> On ARM SMMU we would need to specify an io-pgtable variant and the > >>>> opaque structure would be bits of io_pgtable_cfg (tcr, mair, etc.) > >>>> > >>>> The problem of portability is slightly different here, because > >>>> while PASID table format is specific to an IOMMU, page table format > >>>> might be the same across multiple flavors of IOMMUs. For instance, > >>>> the PASID table format I use in this series can only be found in > >>>> the ARM > >>>> SMMUv3 architecture, but the page table format is the same as ARM > >>>> SMMUv2, and other MMUs. I'd like to implement an IOMMU independent > >>>> of any page-table formats, that could use whatever the host offers > >>>> (not necessarily MMU PT). The model numbers described above > >>>> wouldn't be suitable, > >> so we'd need another set of magic numbers for page table formats. > >>> > >>> Not sure if I totally got your points. We may assume PASID table > >>> format differs from vendor to vendor. For the page table format, I > >>> assume you mean the page table of process. Or in other words MMU page table. > >>> > >>> I'm a little bit confused about the following statements. Could you > >>> speak a little bit > >> more? > >>> Is it for virtual SVM or user-space driver SVM usage? > >>> " I'd like to implement an IOMMU independent of any page-table > >>> formats, that could use whatever the host offers (not necessarily MMU PT)." > >> > >> The IOMMU might be using a different page table format than the MMU. > >> I'm currently toying with the idea of having a portable > >> paravirtualized IOMMU that can query the page table format used by > >> pIOMMU, and adopt it if the page table handling code is available in > >> the guest. This would be for non-SVM nested translation, using VFIO > >> to bind page tables private to the IOMMU. I'll send a separate RFC to discuss this. > > > > sounds interesting. Look forward to your further work. > > > >> > >>> I's nice to have such discussion. Let's co-work and have a well-defined API. > >> > >> I agree. I don't plan to send a new version immediately since it will > >> take some time rework, but we can sync-up here or in private to avoid conflicting > proposals. > > > > yes, let's keep in touch. For "struct vfio_device_svm", how about define it in such > way: > > > > struct vfio_device_svm { > > __u32 argsz; > > /* data length would be sizeof(pasid_table_info) */ > > #define VFIO_SVM_BIND_PASIDTP (1 << 0) > > /* data length would be 4 byte */ > > #define VFIO_BIND_PASID (1 << 1) > > /* data length would be sizeof(x) */ > > #define VFIO_SVM_PASID_RELEASE_FLUSHED (1 << 2) > > /* data length would be sizeof(y) */ > > #define VFIO_SVM_PASID_RELEASE_CLEAN (1 << 3) > > (Note that I'll probably drop these two flags in next version.) > > > /* data length would be sizeof(pgtable_info) */ > > #define VFIO_SVM_BIND_PGTABLE (1 << 4) > > __u32 flags; > > __u32 length; /*can remove it if you think it's better */ > > __u8 data[]; /* payload data for different bind/unbind request*/ > > }; > > > > struct pasid_table_info { > > __u16 sid; /* It's BDF of device. */ > > __u64 ptr; /* PASID table ptr */ > > __u64 size; /* PASID table size in bytes */ > > __u32 model; /* magic number */ > > __u32 variant; /* YiL: maybe move it to opaque? */ > > Yes, I think it could be in opaque, not all formats would need that. ptr, size, model > and opaque seem sufficient. > > > __u8 opaque[]; /* IOMMU-specific details */ > > }; > > > > I added a new field as "sid". The reason is as below: > > Since the IOCTL is going to be on container fd, then needs to tell > > where does the bind request come from. SID could be used to filter the target core > IOMMU. > > Also, it's needed to find the corresponding context entry with the SID > > on Intel platform. I assume you also need an equivalent on your work. > > Pls let me know if you prefer different definition. > > Actually moving the ioctl to the container changes the BIND model: the PASID table > would be attached to all devices in the container, so there shouldn't be any device > selector in the ioctl. I think it is more in line with the existing MAP/UNMAP API, > where a container represents an address space, and all devices in the container > access the same virtual addresses. > It adds complexity on the driver side, but seems more clear from the user point of > view. Hi Jean, I do need a device selector here. Bind whole guest PASID table in host thus it needs to modify a context entry which is per-device. In other words, it is to replace the PASID table in host instead of modifying some entries. Maybe I can move it to opaque. Then it won't bother other user. Since my op is actually per-device, so it may be an open if I need to place IOCTL in vfio-pci. Anyhow, iommu ops usage mainly happen in type1, I don't want to break the rule so far. So I would go on with the consensus we had here. Regards, Yi L