force app kill patch

ckoenig.leichtzumerken@xxxxxxxxx (Christian König) · Wed, 18 Apr 2018 11:03:56 +0200

Am 18.04.2018 um 11:00 schrieb Liu, Monk:
>
> 1.Drm_sched_entity_fini(): it exit right after entity->job_queue 
> empty, [ but that time scheduler is not fast enough to deal with this 
> entity now ]
>
> That should never happen.
>
> The last job from the entity->job_queue is only removed after the 
> scheduler is done with the entity (at least that was the original 
> idea, not sure if that still works as expected).
>
> [ML] no thatâ??s not true and we already catch the kernel NULL pointer 
> issue with a entity->last_scheduled fence get double put , exactly 
> like the way I described in the scenario â?¦
>
> Pixel already fixed it by moving the put/get pair on 
> entity->last_scheduled prior to spsc_queue_pop() and the race issue is 
> therefore avoided
>

Yeah, already seen and reviewed that. That's a good catch, please make 
sure that this gets pushed to amd-staging-drm-next ASAP.

Christian.

> /Monk
>
> *From:*Christian KÃ¶nig [mailto:ckoenig.leichtzumerken at gmail.com]
> *Sent:* 2018å¹´4æ??18æ?¥16:36
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; Deng, Emily <Emily.Deng at amd.com>
> *Cc:* amd-gfx at lists.freedesktop.org
> *Subject:* Re: force app kill patch
>
>     See that in â??sched_entity_finiâ??, we only call
>     dma_fence_put(entity->last_scheduledâ?? under the condition of â??If
>     (entity->fini_status)â??, so
>
>     This way there is memory leak for the case of â??entity->fini_stats ==0â??
>
> Good catch, we indeed should fix that.
>
>
>     1.Drm_sched_entity_fini(): it exit right after entity->job_queue
>     empty, [ but that time scheduler is not fast enough to deal with
>     this entity now ]
>
> That should never happen.
>
> The last job from the entity->job_queue is only removed after the 
> scheduler is done with the entity (at least that was the original 
> idea, not sure if that still works as expected).
>
> Regards,
> Christian.
>
> Am 18.04.2018 um 09:20 schrieb Liu, Monk:
>
>     *Correctio for the scenario *
>
>     After we move fence_put(entity->last_sched) out of the fini_status
>     check:
>
>     A potential race issue for the scenario:
>
>     1.Drm_sched_entity_fini(): it exit right after entity->job_queue
>     empty, [ but that time scheduler is not fast enough to deal with
>     this entity now ]
>
>     2.Drm_sched_entity_cleanup() : it call
>     dma_fence_put(entity->last_scheduled) Â [ but this time
>     entity->last_scheduledÂ  actually points to the fence prior to the
>     real last one ]
>
>     3.Scheduler_main() now dealing with this entity: it call
>     dma_fence_put(entity->last_scheduled)Â Â Â  [Â Â  Now this fence get
>     double put !!! Â ]
>
>     4.Scheduler_main() now call dma_fence_get() on the *real* last one !
>
>     So eventually the real last one fence triggers memory leak and
>     more critical the double put fence cause NULL pointer access
>
>     /Monk
>
>     *From:*Liu, Monk
>     *Sent:* 2018å¹´4æ??18æ?¥15:11
>     *To:* Koenig, Christian <Christian.Koenig at amd.com>
>     <mailto:Christian.Koenig at amd.com>; Deng, Emily
>     <Emily.Deng at amd.com> <mailto:Emily.Deng at amd.com>
>     *Cc:* amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     *Subject:* force app kill patch
>
>     Hi Christian & Emily
>
>     I think the v4 fix for â??fix force app kill hangâ??is still not good
>     enough:
>
>     First:
>
>     See that in â??sched_entity_finiâ??, we only call
>     dma_fence_put(entity->last_scheduledâ??under the condition of â??If
>     (entity->fini_status)â??, so
>
>     This way there is memory leak for the case of â??entity->fini_stats ==0â??
>
>     Second:
>
>     If we move dma_fence_put(entity->last_scheduled) out of the
>     condition of â??if (entity->fini_status)â??, the memory leak issue can
>     be fixed
>
>     But there will be kernel NULL pointer access, I think the time you
>     call dma_fence_put(entity->last_scheduledâ??) may actually executed
>     **not**
>
>     On the last scheduled fence of this entity, because it is run
>     without â??thread_park/unparkâ??pair which to make sure scheduler not
>     dealing this entity
>
>     So with certain race issue, here is the scenario:
>
>     1.scheduler is doing the dma_fence_put() on the 1^st fence,
>
>     2.scheduler set entity->last_scheduled to 1^st fence
>
>     3.now sched_entity_fini() run, and it call dma_fence_put() on
>     entity->last_scheduled
>
>     4.now this 1^st fence is actually put double time and the real
>     last fence wonâ??t get put by expected
>
>     any idea?
>
>     /Monk
>
>
>
>
>     _______________________________________________
>
>     amd-gfx mailing list
>
>     amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>
>     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180418/b89b9f33/attachment-0001.html>