On Mon, Mar 18, 2019 at 12:06:13AM +0100, Ahmed S. Darwish wrote: > > => Now that the dust has settled, here's a summary of this huge > 50-email thread (thanks Daniel, Noralf, John, everyone!). > > => Parts of this document are a direct rewording of Daniel's replies, > so I took the liberty of adding a Co-developed-by tag here.. > > => This is only a summary, and _not_ an official patch submission. > It's now Show-me-the-code time ;-) > > Subject: [PATCH] Documentation: gpu: Add initial DRM panic design > > Co-developed-by: Daniel Vetter <daniel@xxxxxxxx> > Signed-off-by: Ahmed S. Darwish <darwish.07@xxxxxxxxx> > --- > Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++ > 1 file changed, 124 insertions(+) > create mode 100644 Documentation/gpu/drm-panic-design.rst > > diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst > new file mode 100644 > index 000000000000..ba306193f347 > --- /dev/null > +++ b/Documentation/gpu/drm-panic-design.rst > @@ -0,0 +1,124 @@ > + > +======================== > +A DRM-based panic viewer > +======================== > + > +The Linux Kernel currently contains the necessary plumbing for viewing > +a kernel panic log using DRM-based kmsg dumpers, even if the system is > +currently running a graphical session (e.g. wayland). > + > +.. _drm_panic_design: > + > +Implementation design > +===================== > + > +Code pathes running in a panic context face several constraints: > + > +1. Code must be fully synchronous: interrupts are disabled > +2. Dynamic memory allocations are not allowed > +3. Cannot acquire any mutexes: atomic context, due to 1. > +4. Cannot acquire any spin locks: potential spinning-forever risk Maybe rephrase as: 3. No sleeping (there's other ways to sleep than just memory allocations and acquiring a mutex) 4. No unconditional locking at all (there's more than spinlocks/mutexes). E.g. one of the most important locks is drm_modest_lock, which is a ww_mutex. > + > +For the *DRM* panic code, the extra conditions below apply: > + > +5. Code must only trylock relevant DRM subsystem locks > +6. If any trylock operation fails, the code *must not* go through > +7. All rendering is done on the GPU's "current display buffer" used > + at the time of panic(): no HW programming is done at all. Maybe let's clarify this a bit, since from the discussions it sounded like at least amdgpu needs to touch a few bits (disable tiling, the indirect vram write registers to access vram outside of the bar): "Only the least amount of HW programming (preferrably none) is done, exceptions would be disabling tiled/compressed scanout." > +8. The code must be non-intrusive, and *must not* affect any other > + panic handling mechanism (netconsole, ramoops, kexec, etc.) Hm, I'm not entirely clear on what you mean here. Maybe some more examples of what would be a bad idea? > + > +Rationale follows. > + > + > +Spin locks > +---------- > + > +Acquiring a spin lock in a panic() context is potentially lethal: > +the lock might've been already acquired, _permanently_, by another > +core that is now fully shut down through an IPI from the panic()-ing > +core. > + > +Moreover, at least on x86, the first target architecture for this > +work, the panic()-ing core wait by default for *a full second* until > +all other cores finish their non-preemptible regions and terminate. > +If that did not work out, it even tries more aggressively with NMIs. > + > +So if the other non panic()-ing cores was holding a DRM-related lock > +through spin_lock_irqsave() for more than one second, then it's a > +bug in the DRM layer code. Thus, the DRM panic viewer cannot do > +anything and should just exit. [A] > + > +What if the non panic()-ing core was holding a DRM lock through > +barebone spin_lock()? Interrupts are enabled there, so the IPI will be > +handled, and thus that core will effectively die while the lock is > +*forever held*. [B] > + > + > +Trylocking > +---------- > + > +The DRM panic code always *tries* to acquire the *minimum relevant > +set* of DRM related locks, through the basic :c:func:`spin_trylock()` > +mechanism. > + > +From case [A] and case [B] above, if the trylock operation fails, > +there's no point in retrying it multiple times: the relevant locks > +are in a broken and unrecoverable state anyway. > + > +Moreover, The panic code cannot also just ignore the DRM locks and > +force its way through: a broken non-preemptible DRM region implies > +either invalid SW state (e.g. broken linked list pointers), or a GPU > +in an unknown HW state. > + > +A GPU in an unknown HW state is quite dangerous: it has access to the > +full system memory, and if poked incorrectly, there's a really good > +chance it can kill the entire machine. > + > + > +GPU hardware access > +------------------- > + > +In the current state, a full GPU reset, modesetting, or even disabling > +GPU planes, is not doable under a panic() context: it implies going > +through a potentially huge set of DRM call-chains that cannot be > +sanely verified against the :ref:`drm_panic_design` requirements > +(e.g. memory allocations, spinlocks, etc.). > + > +The current approach is simple: run the minimal amount of code > +necessary to draw pixels onto the current scanout buffers. Instead > +of disabling GPU planes, the biggest visible rectangle is just picked. > + > +*Usually* there should be a main window that is big enough to show the > +oops. > + > + > +CI testing > +---------- > + > +One of the things that killed earlier linux DRM panic handling efforts, > +beside getting into deep DRM call-chains that cannot be verified, was > +that it couldn't be tested except with real oopses. > + > +The first set of bug reports was whack-a-molde kind of bugs where the > +oops displayed was caused by the DRM panic handler itself instead of > +the real oops causing the panic. > + > +Thus, the :ref:`drm_panic_design` requirements was created. Moreover: > + > + - Special hooks are added at the spin_lock() level to complain > + loudly if a spin lock operation was tried under the DRM panic > + context. This could be easily noticed/reported by CI testing. > + > + - *Non-destructive* testing of the DRM panic code, emulating a > + real panic path context as much as possible (e.g. by disabling > + irqs and enabling the spin lock hooks earlier mentioned), is > + created. This is necessary for heaviling testing the DRM panic > + code through `CI machines <https://lwn.net/Articles/735468/>`_. > + > + > +Supported drivers > +================= > + > +* Intel i915-compatible cards Excellent summary, thanks for typing this up. -Daniel > > > -- > darwi > http://darwish.chasingpointers.com -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx