Don't have the code in front of me now but as far as I remember it will only prematurely terminate in drm_sched_cleanup_jobs if there is timeout work in progress which would not be the case if nothing hangs. Andrey ________________________________________ From: Erico Nunes <nunes.erico@xxxxxxxxx> Sent: 17 May 2019 17:42:48 To: Grodzovsky, Andrey Cc: Deucher, Alexander; Koenig, Christian; Zhou, David(ChunMing); David Airlie; Daniel Vetter; Lucas Stach; Russell King; Christian Gmeiner; Qiang Yu; Rob Herring; Tomeu Vizoso; Eric Anholt; Rex Zhu; Huang, Ray; Deng, Emily; Nayan Deshmukh; Sharat Masetty; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; lima@xxxxxxxxxxxxxxxxxxxxx Subject: Re: lima_bo memory leak after drm_sched job destruction rework [CAUTION: External Email] On Fri, May 17, 2019 at 10:43 PM Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> wrote: > On 5/17/19 3:35 PM, Erico Nunes wrote: > > Lima currently defaults to an "infinite" timeout. Setting a 500ms > > default timeout like most other drm_sched users do fixed the leak for > > me. > > I am not very clear about the problem - so you basically never allow a > time out handler to run ? And then when the job hangs for ever you get > this memory leak ? How it worked for you before this refactoring ? As > far as I remember sched->ops->free_job before this change was called > from drm_sched_job_finish which is the work scheduled from HW fence > signaled callback - drm_sched_process_job so if your job hangs for ever > anyway and this work is never scheduled when your free_job callback was > called ? In this particular case, the jobs run successfully, nothing hangs. Lima currently specifies an "infinite" timeout to the drm scheduler, so if a job did did hang, it would hang forever, I suppose. But this is not the issue. If I understand correctly it worked well before the rework because the cleanup was triggered at the end of drm_sched_process_job independently on the timeout. What I'm observing now is that even when jobs run successfully, they are not cleaned by the drm scheduler because drm_sched_cleanup_jobs seems to give up based on the status of a timeout worker. I would expect the timeout value to only be relevant in error/hung job cases. I will probably set the timeout to a reasonable value anyway, I just posted here to report that this can possibly be a bug in the drm scheduler after that rework. _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx