sometime user space submits bad command steam to kernel and with current scheme gpu-scheduler will always resubmit all un-signaled job to hw ring after gpu reset thus this bad submit will infinitly trigger GPU hang. this patch serials implement a system called guilty context, which can avoid submitting malicious jobs and invalidate the related context behind them, that way the regular application can still continue to run, and other VF can also suffer less GPU time reductions the guilty charge is simple: if a job hang too much times exceeds the threshold, we consider it guilty, and we invalidates the context behind it, and pop out all job in its entities of each scheduler. the next IOCTL on this CTX handler will get -ENODEV error thus UMD can know this context is released by driver due to its malicious command submit. Monk Liu (5): drm/amdgpu:keep ctx alive till all job finished drm/amdgpu:some modifications in amdgpu_ctx drm/amdgpu:Impl guilty ctx feature for sriov TDR drm/amdgpu:change sriov_gpu_reset interface drm/amdgpu:sriov TDR only recover hang ring drivers/gpu/drm/amd/amdgpu/amdgpu.h | 12 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 26 ++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 39 ++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 43 ++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 30 +++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +- drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 2 +- drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 87 ++++++++++++++++++++++++--- drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 3 + 13 files changed, 209 insertions(+), 47 deletions(-) -- 2.7.4