>> + for (i = 0; i < to_nsmmu(smmu)->num_inst; i++) >It might make more sense to make this the innermost loop, i.e.: for (i = 0; i < nsmmu->num_inst; i++) reg &= readl_relaxed(nsmmu_page(smmu, i, page)... >since polling the instances in parallel rather than in series seems like it might be a bit more efficient. Sync register is programmed at the same time for both instances. The status check is serialized. I can update it to check status of both at the same time. >> + if (smmu->impl->tlb_sync) { >> + smmu->impl->tlb_sync(smmu, page, sync, status); >What I'd hoped is that rather than needing a hook for this, you could just override smmu_domain->tlb_ops from .init_context to wire up the alternate .sync method directly. That would save this extra level of indirection. With arm_smmu_domain now available in arm-smmu.h, arm-smmu-nvidia.c can directly update the tlb_ops->tlb_sync and avoid indirection. Will update in next version. -KR