Hey, Sorry for the wide email, but I figured someone recently contributing to / maintaining the Qualcomm SMMU driver may have some proper insights into this. Recently I remembered that performance on some Qualcomm platforms takes a major hit when you use iommu.strict=1/CONFIG_IOMMU_DEFAULT_DMA_STRICT. On the sa8775p-ride, I see most TLB sync calls to be about 150 us long, with some spiking to 500 us, etc: [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd start -p function_graph -g qcom_smmu_tlb_sync --max-graph-depth 1 plugin 'function_graph' [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) ! 144.062 us | qcom_smmu_tlb_sync(); On my sc8280xp-lenovo-thinkpad-x13s (only other Qualcomm platform I can compare with) I see around 2-15 us with spikes up to 20-30 us. That's thanks to this patch[0], which I guess improved the platform from 1-2 ms to the ~10 us number. It's not entirely clear to me how a DPU specific programming affects system wide SMMU performance, but I'm curious if this is the only way to achieve this? sa8775p doesn't have the DPU described even right now, so that's a bummer as there's no way to make a similar immediate optimization, but I'm still struggling to understand what that patch really did to improve things so maybe I'm missing something. I'm honestly not even sure what a "typical" range for TLB sync time would be, but on sa8775p-ride its bad enough that some IRQs like UFS can cause RCU stalls (pretty easy to reproduce with fio basic-verify.fio for example on the platform). It also makes running with iommu.strict=1 impractical as performance for UFS, ethernet, etc drops 75-80%. Does anyone have any bright ideas on how to improve this, or if I'm even in the right for assuming that time is suspiciously long? Thanks, Andrew [0] https://lore.kernel.org/linux-arm-msm/CAF6AEGs9PLiCZdJ-g42-bE6f9yMR6cMyKRdWOY5m799vF9o4SQ@xxxxxxxxxxxxxx/