=> Now that the dust has settled, here's a summary of this huge 50-email thread (thanks Daniel, Noralf, John, everyone!). => Parts of this document are a direct rewording of Daniel's replies, so I took the liberty of adding a Co-developed-by tag here.. => This is only a summary, and _not_ an official patch submission. It's now Show-me-the-code time ;-) Subject: [PATCH] Documentation: gpu: Add initial DRM panic design Co-developed-by: Daniel Vetter <daniel@xxxxxxxx> Signed-off-by: Ahmed S. Darwish <darwish.07@xxxxxxxxx> --- Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++ 1 file changed, 124 insertions(+) create mode 100644 Documentation/gpu/drm-panic-design.rst diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst new file mode 100644 index 000000000000..ba306193f347 --- /dev/null +++ b/Documentation/gpu/drm-panic-design.rst @@ -0,0 +1,124 @@ + +======================== +A DRM-based panic viewer +======================== + +The Linux Kernel currently contains the necessary plumbing for viewing +a kernel panic log using DRM-based kmsg dumpers, even if the system is +currently running a graphical session (e.g. wayland). + +.. _drm_panic_design: + +Implementation design +===================== + +Code pathes running in a panic context face several constraints: + +1. Code must be fully synchronous: interrupts are disabled +2. Dynamic memory allocations are not allowed +3. Cannot acquire any mutexes: atomic context, due to 1. +4. Cannot acquire any spin locks: potential spinning-forever risk + +For the *DRM* panic code, the extra conditions below apply: + +5. Code must only trylock relevant DRM subsystem locks +6. If any trylock operation fails, the code *must not* go through +7. All rendering is done on the GPU's "current display buffer" used + at the time of panic(): no HW programming is done at all. +8. The code must be non-intrusive, and *must not* affect any other + panic handling mechanism (netconsole, ramoops, kexec, etc.) + +Rationale follows. + + +Spin locks +---------- + +Acquiring a spin lock in a panic() context is potentially lethal: +the lock might've been already acquired, _permanently_, by another +core that is now fully shut down through an IPI from the panic()-ing +core. + +Moreover, at least on x86, the first target architecture for this +work, the panic()-ing core wait by default for *a full second* until +all other cores finish their non-preemptible regions and terminate. +If that did not work out, it even tries more aggressively with NMIs. + +So if the other non panic()-ing cores was holding a DRM-related lock +through spin_lock_irqsave() for more than one second, then it's a +bug in the DRM layer code. Thus, the DRM panic viewer cannot do +anything and should just exit. [A] + +What if the non panic()-ing core was holding a DRM lock through +barebone spin_lock()? Interrupts are enabled there, so the IPI will be +handled, and thus that core will effectively die while the lock is +*forever held*. [B] + + +Trylocking +---------- + +The DRM panic code always *tries* to acquire the *minimum relevant +set* of DRM related locks, through the basic :c:func:`spin_trylock()` +mechanism. + +From case [A] and case [B] above, if the trylock operation fails, +there's no point in retrying it multiple times: the relevant locks +are in a broken and unrecoverable state anyway. + +Moreover, The panic code cannot also just ignore the DRM locks and +force its way through: a broken non-preemptible DRM region implies +either invalid SW state (e.g. broken linked list pointers), or a GPU +in an unknown HW state. + +A GPU in an unknown HW state is quite dangerous: it has access to the +full system memory, and if poked incorrectly, there's a really good +chance it can kill the entire machine. + + +GPU hardware access +------------------- + +In the current state, a full GPU reset, modesetting, or even disabling +GPU planes, is not doable under a panic() context: it implies going +through a potentially huge set of DRM call-chains that cannot be +sanely verified against the :ref:`drm_panic_design` requirements +(e.g. memory allocations, spinlocks, etc.). + +The current approach is simple: run the minimal amount of code +necessary to draw pixels onto the current scanout buffers. Instead +of disabling GPU planes, the biggest visible rectangle is just picked. + +*Usually* there should be a main window that is big enough to show the +oops. + + +CI testing +---------- + +One of the things that killed earlier linux DRM panic handling efforts, +beside getting into deep DRM call-chains that cannot be verified, was +that it couldn't be tested except with real oopses. + +The first set of bug reports was whack-a-molde kind of bugs where the +oops displayed was caused by the DRM panic handler itself instead of +the real oops causing the panic. + +Thus, the :ref:`drm_panic_design` requirements was created. Moreover: + + - Special hooks are added at the spin_lock() level to complain + loudly if a spin lock operation was tried under the DRM panic + context. This could be easily noticed/reported by CI testing. + + - *Non-destructive* testing of the DRM panic code, emulating a + real panic path context as much as possible (e.g. by disabling + irqs and enabling the spin lock hooks earlier mentioned), is + created. This is necessary for heaviling testing the DRM panic + code through `CI machines <https://lwn.net/Articles/735468/>`_. + + +Supported drivers +================= + +* Intel i915-compatible cards -- darwi http://darwish.chasingpointers.com _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx