[[ Adding Sebastian, who is quite experienced in intricate locking situations due to daily PREEMPT_RT work.. ]] On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote: > On Wed, Mar 13, 2019 at 08:49:17AM +0100, John Ogness wrote: > > On 2019-03-12, Ahmed S. Darwish <darwish.07@xxxxxxxxx> wrote: > > > On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote: [……] > > > > > > > > If we aim for perfect this should be a trylock still, maybe using > > > > our own device list. > > > > > > > > > > I definitely agree here. > > > > > > The lock may already be locked either by a stopped CPU, or by the > > > very same CPU we execute panic() on (e.g. NMI panic() on the > > > printing CPU). > > > > > > This is why it's very common for example in serial consoles, which > > > are usually careful about re-entrance and panic contexts, to do: > > > > > > xxxxxx_console_write(...) { > > > if (oops_in_progress) > > > locked = spin_trylock_irqsave(&port->lock, flags); > > > else > > > spin_lock_irqsave(&port->lock, flags); > > > } > > > > > > I'm quite positive we should do the same for panic drm drivers. > > > > This construction will continue, even if the trylock fails. It only > > makes sense to do this if the driver has a chance of being > > successful. Ignoring locks is a serious issue. I personally am quite > > unhappy that the serial drivers do this, which was part of my motivation > > for the new printk design I'm working on. > > > > If the driver is not capable of doing something useful on a failed > > trylock, then I recommend just skipping that device. Maybe trying it > > again later after trying all the devices? > > Ah yes missed that. If the trylock fails anywhere, we must bail out. > > Not sure retrying is useful, my experience from at least drm is that > either you're lucky, and drm wasn't doing anything right when the machine > blew up, and then the trylocks will all go through. Or you're unlucky, and > most likely that means drm itself blew up, and no amount of retrying is > going to help. I wouldn't bother. > Ack, retrying most probably won't help. I see that, at least in x86: kernel/panic.c::panic() => kernel/panic.c::smp_send_stop() => arch/x86/incluse/smp.h::smp_send_stop(0) => arch/x86/kernel/smp.c::native_stop_other_cpus(wait) /* * Don't wait longer than a second if the caller * didn't ask us to wait. */ timeout = USEC_PER_SEC; while (num_online_cpus() > 1 && (wait || timeout--)) udelay(1); So the panic()-ing core wait by default for _a full second_ until all other cores finish their non-preemptible regions. If that did not work out, it even tries more aggressively with NMIs. So if one of the other non panic()-ing cores was holding a DRM related lock through spin_lock_irqsave() for more than one second, well, it's a bug in the DRM layer code anyway.. [A] Problem is, what will happen if the non panic()-ing core was holding a DRM lock through barebone spin_lock()? Interrupts are enabled there, so the IPI will be handled, and thus that core will effectively die while the lock is forever held :-( [B] But, well, yeah, I don't think there's a solution for the [B] part except by spin_trylock() the fewest amount of locks possible, and bailing out if anyone of them is held. For reference, I asked over IRC why not just resetting GPU state, but it's not easy as it sounds to people outside of the gfx world: [abridged] <darwi> So it seems Windows is forcing driver writers to implement a gpu reset method [*] <darwi> since Windows vista <danvet> yeah <danvet> not going to do that from panic <danvet> there's a non-zero chance that gpu reset takes down the system with at least lots of hw/driver combos <danvet> I think all our drivers also have gpu reset support <danvet> if something is still rendering while we panic, mea culpa <danvet> only way around that is switching planes <danvet> which we tried with the old approach, and that's just too hard <danvet> you end up running a few 100kloc of code worst case <danvet> no way to audit/validate that for panic context <danvet> but the rendering shouldn't really matter <anholt> yeah, actually modesetting in a panic seems hopeless. <danvet> at least for modern compositors <danvet> anholt, yeah I killed that idea <danvet> so they only render to the backbuffer <danvet> and we won't show that with a panic <anholt> yep. just draw on a plane. for everything but uncomposited x11, it'll already be idle. <danvet> so only really an issue on boot splash and frontbuffer x11 <danvet> and there a few cursors/characters drawn won't really matter in most cases <danvet> anholt, yeah that's the idea, we pick the currently showing plane and have a simple draw_xy functions which knows how to draw bright/dark pixels for any drm_fourcc <danvet> so you can see the oops even if it's yuv or whatever <anholt> nice Thanks! [*] DXGKDDI_RESETFROMTIMEOUT callback function: https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/d3dkmddi/nc-d3dkmddi-dxgkddi_resetfromtimeout -- darwi http://darwish.chasingpointers.com _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx