Re: [PATCH 1/3] drm/panthor: Fix tiler OOM handling to allow incremental rendering

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Thu, 25 Apr 2024 11:45:23 +0200

On Thu, 25 Apr 2024 10:28:49 +0100
Steven Price <steven.price@xxxxxxx> wrote:

> On 25/04/2024 08:18, Boris Brezillon wrote:
> > From: Antonino Maniscalco <antonino.maniscalco@xxxxxxxxxxxxx>
> > 
> > If the kernel couldn't allocate memory because we reached the maximum
> > number of chunks but no render passes are in flight
> > (panthor_heap_grow() returning -ENOMEM), we should defer the OOM
> > handling to the FW by returning a NULL chunk. The FW will then call
> > the tiler OOM exception handler, which is supposed to implement
> > incremental rendering (execute an intermediate fragment job to flush
> > the pending primitives, release the tiler memory that was used to
> > store those primitives, and start over from where it stopped).
> > 
> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
> > Signed-off-by: Antonino Maniscalco <antonino.maniscalco@xxxxxxxxxxxxx>
> > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx>  
> 
> Reviewed-by: Steven Price <steven.price@xxxxxxx>
> 
> Although I think the real issue here is that we haven't clearly defined
> the return values from panthor_heap_grow - it's a bit weird to have two
> different error codes for the same "try again later after incremental
> rendering" result. But as a fix this seems most clear.

Yeah, I actually considered returning -EBUSY for the 'max_chunks
reached' situation, but then realized we would also want to trigger
incremental rendering for actual mem allocation failures (when
chunk_count < max_chunks) once the fail-able/non-blocking allocation
logic is implemented, and for this kind of failure it makes more sense
to return -ENOMEM, even though this implies checking against two values
instead of one.

I guess returning -ENOMEM instead of -EBUSY for the case where we have
render passes in-flight wouldn't be too awkward, as this can be seen as
the kernel refusing to allocate more memory.

> 
> Steve
> 
> > ---
> >  drivers/gpu/drm/panthor/panthor_sched.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index b3a51a6de523..6de8c0c702cb 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -1354,7 +1354,13 @@ static int group_process_tiler_oom(struct panthor_group *group, u32 cs_id)
> >  					pending_frag_count, &new_chunk_va);
> >  	}
> >  
> > -	if (ret && ret != -EBUSY) {
> > +	/* If the kernel couldn't allocate memory because we reached the maximum
> > +	 * number of chunks (EBUSY if we have render passes in flight, ENOMEM
> > +	 * otherwise), we want to let the FW try to reclaim memory by waiting
> > +	 * for fragment jobs to land or by executing the tiler OOM exception
> > +	 * handler, which is supposed to implement incremental rendering.
> > +	 */
> > +	if (ret && ret != -EBUSY && ret != -ENOMEM) {
> >  		drm_warn(&ptdev->base, "Failed to extend the tiler heap\n");
> >  		group->fatal_queues |= BIT(cs_id);
> >  		sched_queue_delayed_work(sched, tick, 0);  
>