On Fri, 8 Dec 2023 16:27:39 +0000 Steven Price <steven.price@xxxxxxx> wrote: > On 04/12/2023 17:33, Boris Brezillon wrote: > > Tiler heap growing requires some kernel driver involvement: when the > > tiler runs out of heap memory, it will raise an exception which is > > either directly handled by the firmware if some free heap chunks are > > available in the heap context, or passed back to the kernel otherwise. > > The heap helpers will be used by the scheduler logic to allocate more > > heap chunks to a heap context, when such a situation happens. > > > > Heap context creation is explicitly requested by userspace (using > > the TILER_HEAP_CREATE ioctl), and the returned context is attached to a > > queue through some command stream instruction. > > > > All the kernel does is keep the list of heap chunks allocated to a > > context, so they can be freed when TILER_HEAP_DESTROY is called, or > > extended when the FW requests a new chunk. > > > > v3: > > - Add a FIXME for the heap OOM deadlock > > Sadly I don't have any better solutions that what you've described in > the FIXME. Fully fixing the problem indeed requires having non-blocking/failable BO allocation helpers (which is something we have on our TODO), but there's something we might want to address now: the heap chunk allocation currently happens with the scheduler lock held, which prevents the job timeout from killing the group and leads to an actual deadlock of the whole scheduler. I think we should deffer the heap chunk allocation to a work, queued to the different wq (that's what kbase does BTW).