I was able to bisect it to this commit: $git bisect good 6643ba1ff05d252e451bada9443759edb95eab3b is the first bad commit commit 6643ba1ff05d252e451bada9443759edb95eab3b Author: Luben Tuikov <luben.tuikov@xxxxxxx> Date: Mon Feb 10 18:16:45 2020 -0500 drm/amdgpu: Move to a per-IB secure flag (TMZ) Move from a per-CS secure flag (TMZ) to a per-IB secure flag. Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx> Reviewed-by: Huang Rui <ray.huang@xxxxxxx> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 -- drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 23 ++++++++++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 3 --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 9 ++++----- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 23 +++++++---------------- drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c | 3 +-- drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c | 3 +-- drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 3 +-- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 20 ++++++-------------- include/uapi/drm/amdgpu_drm.h | 7 ++++--- 10 files changed, 44 insertions(+), 52 deletions(-) It's a bit baffling and perhaps there is a clash in the new flag, or libdrm needs to also be updated. Will look at it more tomorrow. My bisect log can be found below. Regards, Luben ------------ git bisect start # good: [31866a9d7d40245316ad7c17b87961f68321cab8] drm/amd/display: Move drm_dp_mst_atomic_check() to the front of dc_validate_global_state() git bisect good 31866a9d7d40245316ad7c17b87961f68321cab8 # bad: [7fd3b632e17e55c5ffd008f9f025754e7daa1b66] drm/amdgpu: fix colliding of preemption git bisect bad 7fd3b632e17e55c5ffd008f9f025754e7daa1b66 # good: [41d073f29e59abdfb0d415033772c01c321086c9] drm/amdgpu/vcn2.5: fix warning git bisect good 41d073f29e59abdfb0d415033772c01c321086c9 # good: [71da21488b65ade2b789416088b9f2493ad3e056] drm/amd/display: fix dtm unloading git bisect good 71da21488b65ade2b789416088b9f2493ad3e056 # bad: [e3ca25cd2e75824e4dd9e6bb16013ab5f3ec63a6] drm/ttm: individualize resv objects before calling release_notify git bisect bad e3ca25cd2e75824e4dd9e6bb16013ab5f3ec63a6 # good: [7e3452a6536ee7136a4d79f2369f15d5ce96583c] drm/amdgpu: return -EFAULT if copy_to_user() fails git bisect good 7e3452a6536ee7136a4d79f2369f15d5ce96583c # bad: [9b7ac0fb3bbfd6dd001423da497aafec3e8a5131] drm/amdgpu: log on non-zero error conter per IP before GPU reset git bisect bad 9b7ac0fb3bbfd6dd001423da497aafec3e8a5131 # bad: [6643ba1ff05d252e451bada9443759edb95eab3b] drm/amdgpu: Move to a per-IB secure flag (TMZ) git bisect bad 6643ba1ff05d252e451bada9443759edb95eab3b # good: [3387f56e37b2fa8b0fbb3a538bc08daae923bb5f] drm/amd/powerplay: correct the way for checking SMU_FEATURE_BACO_BIT support git bisect good 3387f56e37b2fa8b0fbb3a538bc08daae923bb5f # first bad commit: [6643ba1ff05d252e451bada9443759edb95eab3b] drm/amdgpu: Move to a per-IB secure flag (TMZ) ------------ On 2020-02-19 8:02 p.m., Luben Tuikov wrote: > New developments: > > Running "amdgpu_test -s 1 -t 4" causes timeouts and koops. Attached > is the system log, tested Navi 10: > > [ 144.484547] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! > [ 149.604641] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1459, emitted seq=1462 > [ 149.604779] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process amdgpu_test pid 2696 thread amdgpu_test pid 2696 > [ 149.604788] amdgpu 0000:0b:00.0: GPU reset begin! > ... > > The kernel is at 7fd3b632e17e55c5ffd008f9f025754e7daa1b66 plus > the patch of the original post of this thread (thus the "-dirty"). > > Running the same test on the previous version of the kernel I was running, > at 31866a9d7d40245316ad7c17b87961f68321cab8, succeeds as follows: > > Suite: Basic Tests > Test: Command submission Test (GFX) ...passed > > Run Summary: Type Total Ran Passed Failed Inactive > suites 11 0 n/a 0 0 > tests 63 1 1 0 0 > asserts 526725 526725 526725 0 n/a > > Elapsed time = 0.027 seconds > > Regards, > Luben > > On 2020-02-19 4:40 p.m., Luben Tuikov wrote: >> On 2020-02-19 9:44 a.m., Christian König wrote: >>> Well it should apply on top of amd-staging-drm-next. But I haven't >>> fetched that today yet. >>> >>> Give me a minute to rebase. >> >> This patch seems to have fixed the regression we saw yesterday. >> It applies to amd-staging-drm-next with a small jitter: >> >> $patch -p1 < /tmp/\[PATCH\]\ drm_amdgpu\:\ add\ VM\ update\ fences\ back\ to\ the\ root\ PD.eml >> patching file amdgpu_vm.c >> Hunk #2 succeeded at 1599 (offset -20 lines). >> >> I've been running 'glxgears' on the root window and 'pinion' >> and no problems--clean log. >> >> Tested-by: Luben Tuikov <luben.tuikov@xxxxxxx> >> >> Regards, >> Luben >> >>> >>> Christian. >>> >>> Am 19.02.20 um 15:27 schrieb Tom St Denis: >>>> This doesn't apply on top of 7fd3b632e17e55c5ffd008f9f025754e7daa1b66 >>>> which is the tip of drm-next >>>> >>>> >>>> Tom >>>> >>>> On 2020-02-19 9:20 a.m., Christian König wrote: >>>>> Add update fences to the root PD while mapping BOs. >>>>> >>>>> Otherwise PDs freed during the mapping won't wait for >>>>> updates to finish and can cause corruptions. >>>>> >>>>> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 14 ++++++++++++-- >>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>> index e7ab0c1e2793..dd63ccdbad2a 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>>> @@ -585,8 +585,8 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, >>>>> { >>>>> entry->priority = 0; >>>>> entry->tv.bo = &vm->root.base.bo->tbo; >>>>> - /* One for TTM and one for the CS job */ >>>>> - entry->tv.num_shared = 2; >>>>> + /* Two for VM updates, one for TTM and one for the CS job */ >>>>> + entry->tv.num_shared = 4; >>>>> entry->user_pages = NULL; >>>>> list_add(&entry->tv.head, validated); >>>>> } >>>>> @@ -1619,6 +1619,16 @@ static int amdgpu_vm_bo_update_mapping(struct >>>>> amdgpu_device *adev, >>>>> goto error_unlock; >>>>> } >>>>> + if (flags & AMDGPU_PTE_VALID) { >>>>> + struct amdgpu_bo *root = vm->root.base.bo; >>>>> + >>>>> + if (!dma_fence_is_signaled(vm->last_direct)) >>>>> + amdgpu_bo_fence(root, vm->last_direct, true); >>>>> + >>>>> + if (!dma_fence_is_signaled(vm->last_delayed)) >>>>> + amdgpu_bo_fence(root, vm->last_delayed, true); >>>>> + } >>>>> + >>>>> r = vm->update_funcs->prepare(¶ms, resv, sync_mode); >>>>> if (r) >>>>> goto error_unlock; >>> >> > > > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx