Hi, On Wed, Mar 13, 2019 at 09:35:05AM +0100, Daniel Vetter wrote: > On Tue, Mar 12, 2019 at 11:13:03PM +0100, Ahmed S. Darwish wrote: > > On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote: > > > Den 11.03.2019 20.23, skrev Daniel Vetter: [……] > > > > > > > > class_for_each_device uses klist, which only uses an irqsave spinlock. I > > > > think that's good enough. Comment to that effect would be good e.g. > > > > > > > > /* based on klist, which uses only a spin_lock_irqsave, which we > > > > * assume still works */ > > > > > > > > If we aim for perfect this should be a trylock still, maybe using our own > > > > device list. > > > > > > > > I definitely agree here. > > > > The lock may already be locked either by a stopped CPU, or by the > > very same CPU we execute panic() on (e.g. NMI panic() on the > > printing CPU). > > > > This is why it's very common for example in serial consoles, which > > are usually careful about re-entrance and panic contexts, to do: > > > > xxxxxx_console_write(...) { > > if (oops_in_progress) > > locked = spin_trylock_irqsave(&port->lock, flags); > > else > > spin_lock_irqsave(&port->lock, flags); > > } > > > > I'm quite positive we should do the same for panic drm drivers. > > Yeah Ideally all the locking in the drm path would be trylock only. > > I wonder whether lockdep could help us validate this, with some "don't > allow anything except trylocks in this context". It's easy to audit the > core code with review, but drivers are much tougher. And often end up with > really deep callchains to get at the backing buffers. [……] > > > > Instead what I had in mind is to recreate the worst > > > > possible panic context as much as feasible (disabling interrupts should be > > > > a good start, maybe we can even do an nmi callback), and then call our > > > > panic implementation. That way we can test the panic handler in a > > > > non-destructive way (i.e. aside from last dmesg lines printed to the > > > > screen nothing bad happens to the kernel: No real panic, no oops, no > > > > tainting). > > > > > > The interrupt case I can do, nmi I have no idea. > > > > > > > I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP > > would be a nice non-destructive test-case emulation. > > See above, if we can somehow emulate "all locks are held, only allow > trylock" with lockdep that would be great too. > Hmmmm, to prototype things faster, I'll do it as a hook: spin_lock(spinlock_t *lock) { might_spin_forever_under_panic_path(); ... } Just like how mutexes and DEBUG_ATOMIC_SLEEP do it today: void mutex_lock(struct mutex *lock) { might_sleep(); ... } Hopefully the locking maintainers can accept something like this. If not, let's play with lockdep (which is a bit complex). ===> Thanks to Sebastian for suggesting this! > Plus nmi context, once/if that somehow becomes relevant. > > The thing that killed the old drm panic handling code definitely was that > we flat out couldnt' test it except with real oopses. And that's just > whack-a-mole and bug reporter frustration if you first have a few patch > iterations around "oh sry, it's still some oops in the panic handler, not > the first one that we're seeing" :-/ > Oh, that's a critical point: full data > some data > no data > invalid/useless data First impressions for the feature will definitely count. Hopefully the hooks above and some heavy DRM CI testing __beforehand__ can improve things before publishing to a wider audience.. Cheers, -- darwi http://darwish.chasingpointers.com _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx