Re: [PATCH v4 00/11] Preemption support for A7XX

Neil Armstrong <neil.armstrong@xxxxxxxxxx> · Wed, 18 Sep 2024 09:46:33 +0200

Hi,

On 17/09/2024 13:14, Antonino Maniscalco wrote:
This series implements preemption for A7XX targets, which allows the GPU to
switch to an higher priority ring when work is pushed to it, reducing latency
for high priority submissions.

This series enables L1 preemption with skip_save_restore which requires
the following userspace patches to function:

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30544

A flag is added to `msm_submitqueue_create` to only allow submissions
from compatible userspace to be preempted, therefore maintaining
compatibility.

Preemption is currently only enabled by default on A750, it can be
enabled on other targets through the `enable_preemption` module
parameter. This is because more testing is required on other targets.

For testing on other HW it is sufficient to set that parameter to a
value of 1, then using the branch of mesa linked above, `TU_DEBUG=hiprio`
allows to run any application as high priority therefore preempting
submissions from other applications.

The `msm_gpu_preemption_trigger` and `msm_gpu_preemption_irq` traces
added in this series can be used to observe preemption's behavior as
well as measuring preemption latency.

Some commits from this series are based on a previous series to enable
preemption on A6XX targets:

https://lkml.kernel.org/1520489185-21828-1-git-send-email-smasetty@xxxxxxxxxxxxxx

Signed-off-by: Antonino Maniscalco <antomani103@xxxxxxxxx>
---
Changes in v4:
- Added missing register in pwrup list
- Removed and rearrange barriers
- Renamed `skip_inline_wptr` to `restore_wptr`
- Track ctx seqno per ring
- Removed secure preempt context
- NOP out postamble to disable it instantly
- Only emit pwrup reglist once
- Document bv_rptr_addr
- Removed unused A6XX_PREEMPT_USER_RECORD_SIZE
- Set name on preempt record buffer
- Link to v3: https://lore.kernel.org/r/20240905-preemption-a750-t-v3-0-fd947699f7bc@xxxxxxxxx

Changes in v3:
- Added documentation about preemption
- Use quirks to determine which target supports preemption
- Add a module parameter to force disabling or enabling preemption
- Clear postamble when profiling
- Define A6XX_CP_CONTEXT_SWITCH_CNTL_LEVEL fields in a6xx.xml
- Make preemption records MAP_PRIV
- Removed user ctx record (NON_PRIV) and patch 2/9 as it's not needed
   anymore
- Link to v2: https://lore.kernel.org/r/20240830-preemption-a750-t-v2-0-86aeead2cd80@xxxxxxxxx

Changes in v2:
- Added preept_record_size for X185 in PATCH 3/7
- Added patches to reset perf counters
- Dropped unused defines
- Dropped unused variable (fixes warning)
- Only enable preemption on a750
- Reject MSM_SUBMITQUEUE_ALLOW_PREEMPT for unsupported targets
- Added Akhil's Reviewed-By tags to patches 1/9,2/9,3/9
- Added Neil's Tested-By tags
- Added explanation for UAPI changes in commit message
- Link to v1: https://lore.kernel.org/r/20240815-preemption-a750-t-v1-0-7bda26c34037@xxxxxxxxx

---
Antonino Maniscalco (11):
       drm/msm: Fix bv_fence being used as bv_rptr
       drm/msm/A6XX: Track current_ctx_seqno per ring
       drm/msm: Add a `preempt_record_size` field
       drm/msm: Add CONTEXT_SWITCH_CNTL bitfields
       drm/msm/A6xx: Implement preemption for A7XX targets
       drm/msm/A6xx: Sync relevant adreno_pm4.xml changes
       drm/msm/A6xx: Use posamble to reset counters on preemption
       drm/msm/A6xx: Add traces for preemption
       drm/msm/A6XX: Add a flag to allow preemption to submitqueue_create
       drm/msm/A6xx: Enable preemption for A750
       Documentation: document adreno preemption

  Documentation/gpu/msm-preemption.rst               |  98 +++++
  drivers/gpu/drm/msm/Makefile                       |   1 +
  drivers/gpu/drm/msm/adreno/a2xx_gpu.c              |   2 +-
  drivers/gpu/drm/msm/adreno/a3xx_gpu.c              |   2 +-
  drivers/gpu/drm/msm/adreno/a4xx_gpu.c              |   2 +-
  drivers/gpu/drm/msm/adreno/a5xx_gpu.c              |   6 +-
  drivers/gpu/drm/msm/adreno/a6xx_catalog.c          |   7 +-
  drivers/gpu/drm/msm/adreno/a6xx_gpu.c              | 325 ++++++++++++++-
  drivers/gpu/drm/msm/adreno/a6xx_gpu.h              | 174 ++++++++
  drivers/gpu/drm/msm/adreno/a6xx_preempt.c          | 440 +++++++++++++++++++++
  drivers/gpu/drm/msm/adreno/adreno_gpu.h            |   9 +-
  drivers/gpu/drm/msm/msm_drv.c                      |   4 +
  drivers/gpu/drm/msm/msm_gpu.c                      |   2 +-
  drivers/gpu/drm/msm/msm_gpu.h                      |  11 -
  drivers/gpu/drm/msm/msm_gpu_trace.h                |  28 ++
  drivers/gpu/drm/msm/msm_ringbuffer.h               |  18 +
  drivers/gpu/drm/msm/msm_submitqueue.c              |   3 +
  drivers/gpu/drm/msm/registers/adreno/a6xx.xml      |   7 +-
  .../gpu/drm/msm/registers/adreno/adreno_pm4.xml    |  39 +-
  include/uapi/drm/msm_drm.h                         |   5 +-
  20 files changed, 1117 insertions(+), 66 deletions(-)
---
base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba
change-id: 20240815-preemption-a750-t-fcee9a844b39

Best regards,

I've been running vulkan-cts (1.3.7.3-0-gd71a36db16d98313c431829432a136dbda692a08 from Yocto)
on SM8650-QRD, SM8550-QRD & SM8450-HDK boards with enable_preemption in default value
and forced to 1, and I've seen no regression so far

On SM8550, I've seen a few:
platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* Message HFI_H2F_MSG_GX_BW_PERF_VOTE id 2743 timed out waiting for response
platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg.constprop.0 [msm]] *ERROR* Unexpected message id 2743 on the response queue
but it's unrelated to preempt

and on SM8450:
platform 3d6a000.gmu: [drm:a6xx_gmu_set_oob [msm]] *ERROR* Timeout waiting for GMU OOB set GPU_SET: 0x0
msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 7.3.0.1: hangcheck detected gpu lockup rb 0!
msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 7.3.0.1:     completed fence: 331235
msm_dpu ae01000.display-controller: [drm:hangcheck_handler [msm]] *ERROR* 7.3.0.1:     submitted fence: 331236
adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu fault ring 0 fence 50de4 status 00800005 rb 0000/0699 ib1 0000000000000000/0000 ib2 0000000000000000/0000
msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 7.3.0.1: hangcheck recover!
msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 7.3.0.1: offending task: deqp-vk (/usr/lib/vulkan-cts/deqp-vk)
msm_dpu ae01000.display-controller: [drm:recover_worker [msm]] *ERROR* 7.3.0.1: hangcheck recover!
leading to a VK_ERROR_DEVICE_LOST, but again unrelated to preempt support.

So you can also add:
Tested-by: Neil Armstrong <neil.armstrong@xxxxxxxxxx> # on SM8550-QRD
Tested-by: Neil Armstrong <neil.armstrong@xxxxxxxxxx> # on SM8450-HDK

Thanks,
Neil