drm/msm uses the stall-on-fault model to record the GPU state on the first GPU page fault to help debugging. On systems where the GPU is paired with a MMU-500, there were two problems: 1. The MMU-500 doesn't de-assert its interrupt line until the fault is resumed, which led to a storm of interrupts until the fault handler was called. If we got unlucky and the fault handler was on the same CPU as the interrupt, there was a deadlock. 2. The GPU is capable of generating page faults much faster than we can resume them. GMU (GPU Management Unit) shares the same context bank as the GPU, so if there was a sudden spurt of page faults it would be effectively starved and would trigger a watchdog reset, made even worse because the GPU cannot be reset while there's a pending transaction leaving the GPU permanently wedged. Patch 1 fixes the first problem and is independent of the rest of the series. Patch 3 fixes the second problem and is dependent on patch 2, so there will have to be some cross-tree coordination. I've rebased this series on the latest linux-next to avoid rebase troubles. Signed-off-by: Connor Abbott <cwabbott0@xxxxxxxxx> --- Changes in v3: - Acknowledge the fault before resuming the transaction in patch 1. - Add suggested extra context to commit messages. - Link to v2: https://lore.kernel.org/r/20250120-msm-gpu-fault-fixes-next-v2-0-d636c4027042@xxxxxxxxx Changes in v2: - Remove unnecessary _irqsave when locking in IRQ handler (Robin) - Reuse existing spinlock for CFIE manipulation (Robin) - Lock CFCFG manipulation against concurrent CFIE manipulation - Don't use timer to re-enable stall-on-fault. (Rob) - Use more descriptive name for the function that re-enables stall-on-fault if the cooldown period has ended. (Rob) - Link to v1: https://lore.kernel.org/r/20250117-msm-gpu-fault-fixes-next-v1-0-bc9b332b5d0b@xxxxxxxxx --- Connor Abbott (3): iommu/arm-smmu: Fix spurious interrupts with stall-on-fault iommu/arm-smmu-qcom: Make set_stall work when the device is on drm/msm: Temporarily disable stall-on-fault after a page fault drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 2 ++ drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 +++ drivers/gpu/drm/msm/adreno/adreno_gpu.c | 42 +++++++++++++++++++++++++++- drivers/gpu/drm/msm/adreno/adreno_gpu.h | 24 ++++++++++++++++ drivers/gpu/drm/msm/msm_iommu.c | 9 ++++++ drivers/gpu/drm/msm/msm_mmu.h | 1 + drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 45 +++++++++++++++++++++++++++--- drivers/iommu/arm/arm-smmu/arm-smmu.c | 41 ++++++++++++++++++++++++++- drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 - 9 files changed, 162 insertions(+), 7 deletions(-) --- base-commit: 0907e7fb35756464aa34c35d6abb02998418164b change-id: 20250117-msm-gpu-fault-fixes-next-96e3098023e1 Best regards, -- Connor Abbott <cwabbott0@xxxxxxxxx>