And also i will end up going over the ring_mirror_list twice, once from amdgpu_device_post_asic_reset and later from drm_sched_job_timedout - this might cause double fence processing. Isn't it more correct only do the disconnect from HW fence after the schedules have been stopped and connect back before we restart the schedulers (as you pointed out here before)
What I mean is - should we get rid of dma_fence_add/remove_callback logic in drm_sched_job_timedout and do it for each driver in between
scheduler deactivation and activation back ? Andrey On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote:
Additional to that I would try improve the pre, middle, post handling towards checking if we made some progress in between. In other words we stop all schedulers in the pre handling and disconnect the scheduler fences from the hardware fence like I did in patch "drm/sched: fix timeout handling v2". Then before we do the actual reset in the middle handling we check if the offending job has completed or at least made some progress in the meantime.I understand how to check if the job completed - if it's fence already signaled, but how do I test if the job made 'at least some progress' ?Good question. Maybe we can somehow query from the hardware the number of primitives or pixels processed so far and then compare after a moment?I will check on this later. In the mean while I will update the code with the proposed per hive locking and I will add the check if the guilty job completed before ASIC reset skipping the reset if it's did. Andrey |
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx