[AMD Public Use] Hi all NAVI2X project hit a really hard to solve issue now, and it is turned out to be a general headache of our TDR mechanism , check below scenario:
But by our wish the ideal case is TDR mechanism can detect which ring is the guilty ring and the innocent ring can resubmits all its pending jobs:
Here I have a proposal tend to achieve above goal and it rough procedure is :
1.
take the head job and submit it on this ring
2.
see if it completes, if not then this job is the real “bad job”
3.
take it away from mirror list if this head job is “bad job”
The idea of this is to use “serial” way to re-run and re-check each head job of each RING, in order to take out the real black sheep and its guilty context. P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those rings are intermutually affected to each other. For SDMA ring timeout it definitely proves the head job on SDMA ring is really guilty. Thanks ------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------ |
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx