Re: [PATCH v7] drm/doc: Document DRM device reset expectations

André Almeida <andrealmeid@xxxxxxxxxx> · Wed, 23 Aug 2023 15:07:21 -0300

Hi Rodrigo,

Em 23/08/2023 14:31, Rodrigo Vivi escreveu:
On Fri, Aug 18, 2023 at 05:06:42PM -0300, André Almeida wrote:
Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>

---

v7 changes:
  - s/application/graphical API contex/ in the robustness part (Michel)
  - Grammar fixes (Randy)

v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@xxxxxxxxxx/

v6 changes:
  - Due to substantial changes in the content, dropped Pekka's Acked-by
  - Grammar fixes (Randy)
  - Add paragraph about disabling device resets
  - Add note about integrating reset tracking in drm/sched
  - Add note that KMD should return failure for contexts affected by
    resets and UMD should check for this
  - Add note about lack of consensus around what to do about non-robust
    apps

v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@xxxxxxxxxx/
---
  Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
  1 file changed, 77 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..3694bdb977f5 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
  mmapped regular files. Threads cause additional pain with signal
  handling as well.
  
+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. Some errors
+require resetting the device in order to make the device usable again. This
+section describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Device resets can not be disabled without tainting the kernel, which can lead to
+hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
+device resets is to propagate the message to the application and apply any
+special policy for blocking guilty applications, if any. Corollary is that
+debugging a hung GPU context require hardware support to be able to preempt such
+a GPU context while it's stopped.
+
+Kernel Mode Driver
+------------------
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hang is detected when a job gets stuck executing. KMD
+should keep track of resets, because userspace can query any time about the
+reset status for a specific context. This is needed to propagate to the rest of
+the stack that a reset has happened. Currently, this is implemented by each
+driver separately, with no common DRM interface. Ideally this should be properly
+integrated at DRM scheduler to provide a common ground for all drivers. After a
+reset, KMD should reject new command submissions for affected contexts.

is there any consensus around what exactly 'affected contexts' might mean?
I see i915 pin-point only the context that was at execution with head pointing
at it and doesn't blame the queued ones, while on Xe it looks like we are
blaming all the queued context. Not sure what other drivers are doing for the
'affected contexts'.


"Affected contexts" is a generic term indeed, giving the differences 
from each driver as you already pointed out. amdgpu also tends to affect 
all queued contexts during a reset. This wording was used to fit how 
different drivers works.