On Thu, Mar 17, 2022 at 05:44:57PM +0100, Christian König wrote: > Am 17.03.22 um 17:18 schrieb Rob Clark: > > On Thu, Mar 17, 2022 at 9:04 AM Christian König > > <christian.koenig@xxxxxxx> wrote: > > > Am 17.03.22 um 16:10 schrieb Rob Clark: > > > > [SNIP] > > > > userspace frozen != kthread frozen .. that is what this patch is > > > > trying to address, so we aren't racing between shutting down the hw > > > > and the scheduler shoveling more jobs at us. > > > Well exactly that's the problem. The scheduler is supposed to shoveling > > > more jobs at us until it is empty. > > > > > > Thinking more about it we will then keep some dma_fence instance > > > unsignaled and that is and extremely bad idea since it can lead to > > > deadlocks during suspend. > > Hmm, perhaps that is true if you need to migrate things out of vram? > > It is at least not a problem when vram is not involved. > > No, it's much wider than that. > > See what can happen is that the memory management shrinkers want to wait for > a dma_fence during suspend. > > And if you stop the scheduler they will just wait forever. > > What you need to do instead is to drain the scheduler, e.g. call > drm_sched_entity_flush() with a proper timeout for each entity you have > created. Yeah I think properly flushing the scheduler and stopping it and cutting all drivers over to that sounds like the right approach. Generally suspend shouldn't be such a critical path that this will hurt us, all the other io queues get flushed too afaik. Resume is the thing that needs to go real fast. So a patch set to move all drivers that open code the kthread_park to the right scheduler function sounds like the right idea here to me. -Daniel > > Regards, > Christian. > > > > > > So this patch here is an absolute clear NAK from my side. If amdgpu is > > > doing something similar that is a severe bug and needs to be addressed > > > somehow. > > I think amdgpu's use of kthread_park is not related to suspend, but > > didn't look too closely. > > > > And perhaps the solution for this problem is more complex in the case > > of amdgpu, I'm not super familiar with the constraints there. But I > > think it is a fine solution for integrated GPUs. > > > > BR, > > -R > > > > > Regards, > > > Christian. > > > > > > > BR, > > > > -R > > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch