On Fri, Aug 30, 2024 at 04:09:04PM +0000, Mostafa Saleh wrote: > Hi Jason, > > On Tue, Aug 27, 2024 at 12:51:38PM -0300, Jason Gunthorpe wrote: > > For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting > > as the parent and a user provided STE fragment that defines the CD table > > and related data with addresses translated by the S2 iommu_domain. > > > > The kernel only permits userspace to control certain allowed bits of the > > STE that are safe for user/guest control. > > > > IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2 > > translation, but there is no way of knowing which S1 entries refer to a > > range of S2. > > > > For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to > > flush all ASIDs from the VMID after flushing the S2 on any change to the > > S2. > > > > Similarly we have to flush the entire ATC if the S2 is changed. > > > > I am still reviewing this patch, but just some quick questions. > > 1) How does userspace do IOTLB maintenance for S1 in that case? See https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@xxxxxxxxxx Patch 17 Really, this series and that series must be together. We have a patch planning issue to sort out here as well, all 27 should go together into the same merge window. > 2) Is there a reason the UAPI is designed this way? > The way I imagined this, is that userspace will pass the pointer to the CD > (+ format) not the STE (or part of it). Yes, we need more information from the STE than just that. EATS and STALL for instance. And the cachability below. Who knows what else in the future. We also want to support the V=0, Bypass and Abort STE configurations under the nesting domain (V, CFG required) so that the VIOMMU can remain affiliated with the STE in all cases. This is necessary to allow VIOMMU event reporting to always work. Looking at the masks: STRTAB_STE_0_NESTING_ALLOWED = 0xf80fffffffffffff STRTAB_STE_1_NESTING_ALLOWED = 0x380000ff So we do use alot of the bits. Reformatting from the native HW format into something else doesn't seem better for VMM or kernel.. This is similar to the invalidation design where we also just forward the invalidation command as is in native HW format, and how IDR is done the same. Overall this sort of direct transparency is how I prefer to see these kinds of iommufd HW specific interfaces designed. From a lot of experience here, arbitary marshall/unmarshall is often an antipattern :) > Making user space messing with shareability and cacheability of S1 CD access > feels odd. (Although CD configure page table access which is similar). As I understand it, the walk of the CD table will be constrained by the S2FWB, just like all the other accesses by the guest. So we just take a consistent approach of allowing the guest to provide memattrs in the vSTE, CD, and S1 page table and rely on the HW's S2FWB to police it. As you say there are lots of memattr type bits under direct guest control, it doesn't necessarily make alot of sense to permit everything in those contexts and then add extra code to do something different here. Though I agree it looks odd, it is self-consistent. Jason