On Wed, Jan 17, 2024 at 03:28:09PM +0100, Jocelyn Falempe wrote: > > > On 12/01/2024 14:41, Daniel Vetter wrote: > > On Thu, Jan 04, 2024 at 05:00:49PM +0100, Jocelyn Falempe wrote: > > > This was initialy done for imx6, but should work on most drivers > > > using drm_fb_dma_helper. > > > > > > Signed-off-by: Jocelyn Falempe <jfalempe@xxxxxxxxxx> > > > --- > > > drivers/gpu/drm/drm_fb_dma_helper.c | 55 +++++++++++++++++++++++++++++ > > > include/drm/drm_fb_dma_helper.h | 4 +++ > > > 2 files changed, 59 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c b/drivers/gpu/drm/drm_fb_dma_helper.c > > > index 3b535ad1b07c..caed2935df4f 100644 > > > --- a/drivers/gpu/drm/drm_fb_dma_helper.c > > > +++ b/drivers/gpu/drm/drm_fb_dma_helper.c > > > @@ -15,6 +15,7 @@ > > > #include <drm/drm_framebuffer.h> > > > #include <drm/drm_gem_dma_helper.h> > > > #include <drm/drm_gem_framebuffer_helper.h> > > > +#include <drm/drm_panic.h> > > > #include <drm/drm_plane.h> > > > #include <linux/dma-mapping.h> > > > #include <linux/module.h> > > > @@ -148,3 +149,57 @@ void drm_fb_dma_sync_non_coherent(struct drm_device *drm, > > > } > > > } > > > EXPORT_SYMBOL_GPL(drm_fb_dma_sync_non_coherent); > > > + > > > +#if defined(CONFIG_DRM_PANIC) > > > +/** > > > + * @dev: DRM device > > > + * @drm_scanout_buffer: scanout buffer for the panic handler > > > + * Returns: 0 or negative error code > > > + * > > > + * Generic get_scanout_buffer() implementation, for drivers that uses the > > > + * drm_fb_dma_helper. > > > + */ > > > +int drm_panic_gem_get_scanout_buffer(struct drm_device *dev, > > > + struct drm_scanout_buffer *sb) > > > +{ > > > + struct drm_plane *plane; > > > + struct drm_gem_dma_object *dma_obj; > > > + struct drm_framebuffer *fb; > > > + > > > + drm_for_each_primary_visible_plane(plane, dev) { > > > > Ok that's not enough locking by far. You can't just hope that nothing > > disappears while you're in a panic handler. We've been there and ended up > > reliably oopsing in the panic handler itself. So you _have_ to follow the > > full set of locking rules for all drm structures, or things will just get > > worse at the worst possible moment. > > > > But also, you're not allowed to do anything else than trylock, because a > > panic handler might run from nmi context, and so you cannot even acquire > > irq-safe spinlocks reliably. > > > > Which means: > > > > - You need to be safe against concurrent drm_dev_unregister. Using the > > atomic panic notifier directly for each device should take care of that > > (but maybe that stuff is still not nmi safe, not sure). > > > > - You _have_ to use all the locks. Luckily iterating over the plane list > > doesn't need one, but you have to trylock the plane's modeset lock. > > Which means your nice iterator macro is already toast, because that > > already looks at state it's not allowed to look at without a lock. Or > > well, the plane->state pointer is no-go already. > > mutex_trylock() shouldn't be called from interrupt context, and as the panic > may occurs in irq, I can't use that. Yeah I learned that later that day too, disappointing :-/ What's worse, spin_trylock is also no-go, because on r-t kernel they're essentially sleeping locks. And so spin_trylock can retry in a loop, and if we interrupt another thread at /just/ the right moment where the spinlock is in an inconsistent state, the spin_trylock in the panic handler will loop forever. > But the panic context should guarantee that only one CPU is still running: > https://elixir.bootlin.com/linux/latest/source/kernel/panic.c#L310 Uh no, panics can nest so you might panic while a panic is going on and interrupt it in the middle. So just making sure no other code is running isn't enough unfortunately. Also the ipi stuff that should ensure the other cpu is stopped might not work, afaiui in nmi panic context you cannot send around ipi. > So I think using mutex_is_locked() should be safe: > https://elixir.bootlin.com/linux/latest/source/include/linux/mutex.h#L128 > > This will only check if the lock is not taken, but as it's not possible for > another task to run at the same time, I think that should be good enough ? > > The drawback, is if we want to test without crashing the kernel, then we > need to take the locks with trylock(), (and it's safe this time), but the > code path would be slightly different. So I think this predates some of irc chat, and I haven't typed up what I designed together with John Ogness. Haven't done the patch yet, but rought sketch: - we use a raw_spin_lock. That's the _only_ lock type where trylock is ok - we protect the drm_plane->state update with that lock. That's an extremely tiny section of code, so should be fine. - For just that pointer update we could also use rcu, but using a spinlock has the advantage that drivers with peek/poke mmio support (for accessing unmapping vram or similar) could wrap these mmio access with that spinlock, and so would be able to use peek/poke in their panic handler. As soon as we need to protect against mmio we really need a lock. - Because the lock also provides a barrier there's a lot of things you can safely assume, and so don't need additional locking for that in the panic handler: - Anything that's invariant over the lifetime of a drm_device (as delineated by drm_dev_register/unregister) is safe to access, like the plane list or plane structure - Anything in drm_plane_state that's only computed in atomic_check and not touched in atomic_commit code anymore is safe to access. This means we can get at the bo and everything else without any locks, and anything that is invariant for the lifetime of the bo like vaddr for dma-helper bo is also fine. - Furthermore because ->prepare_fb is called _before_ this point, and we stall at the right places before we call ->cleanup_fb. Unfortunately the same isn't true for ->begin/end_fb_access (by design), so we cannot rely on everything, but we can rely on the fact that the framebuffer is correctly pinned for drivers using dynamic buffer managers. The kernel vmap might not be around though, I need to think more how we best solve that issue. But that's a problem for when we add panic support to the first such driver. I think with this design we'd cover 95% of all cases by simply doing the raw_spin_trylock before we call into the driver's callback. Cheers, Sima -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch