>>> I would like to check why
we need a special sequences for sriov on this pre_reset. If
possible, make it the same as bare metal mode sounds better
solution.
Because before VF FLR calling function
would lead to register access through KIQ, which will not
complete because KIQ/GFX already hang by that time
>>> I don't remember any
register access by amdkfd_pre_reset call, please let me
know if this assumption is wrong .
Please check “void pm_uninit(struct
packet_manager *pm)” which is invoked inside of
amdkfd_pre_reset() :
It will call uninitialized() in
kfd_kernel_queue.c file
And then go to the path of
“kq->mqd_mgr->destroy_mqd(…)”
And finally it calls “static int
kgd_hqd_destroy(…)” in amdgpu_amdkfd_gfx_v10.c
539 {
540 struct amdgpu_device *adev =
get_amdgpu_device(kgd);
541 enum hqd_dequeue_request_type
type;
542 unsigned long end_jiffies;
543 uint32_t temp;
544 struct v10_compute_mqd *m =
get_mqd(mqd);
545
546 #if 0
547 unsigned long flags;
548 int retry;
549 #endif
550
551 acquire_queue(kgd, pipe_id,
queue_id);
//this introduce register access via KIQ
552
553 if (m->cp_hqd_vmid == 0)
554 WREG32_FIELD15(GC, 0,
RLC_CP_SCHEDULERS, scheduler1, 0);
//this introduce register
access via KIQ
555
556 switch (reset_type) {
557 case
KFD_PREEMPT_TYPE_WAVEFRONT_DRAIN:
558 type = DRAIN_PIPE;
559 break;
560 case
KFD_PREEMPT_TYPE_WAVEFRONT_RESET:
561 type = RESET_WAVES;
562 break;
563 default:
564 type = DRAIN_PIPE;
565 break;
566 }
624 WREG32(SOC15_REG_OFFSET(GC, 0,
mmCP_HQD_DEQUEUE_REQUEST), type);
//this introduce register
access via KIQ
625
626 end_jiffies = (utimeout * HZ /
1000) + jiffies;
627 while (true) {
628 temp =
RREG32(SOC15_REG_OFFSET(GC, 0, mmCP_HQD_ACTIVE));
//this introduce register
access via KIQ
629 if (!(temp &
CP_HQD_ACTIVE__ACTIVE_MASK))
630 break;
631 if (time_after(jiffies,
end_jiffies)) {
632 pr_err("cp queue
preemption time out.\n");
633 release_queue(kgd);
634 return -ETIME;
635 }
636 usleep_range(500, 1000);
637 }
638
639 release_queue(kgd);
640 return 0;
If we use the sequence from bare-metal,
all above
highlighted register access will not work because
KIQ/GFX already died by that time which means the
amdkfd_pre_reset() is actually not working as expected.
_____________________________________
Monk Liu|GPU
Virtualization Team |AMD
I don't remember any register access by
amdkfd_pre_reset call, please let me know if this
assumption is wrong .
This function will use hiq to access CP, in case CP already
hang, we might not able to get the response from hw and will
got a timeout. I think kfd internal should handle this.
Felix already have some comments on that.
I would like to check why we need a special sequences for
sriov on this pre_reset. If possible, make it the same as
bare metal mode sounds better solution.
Regards
Shaoyun.liu
Oh, by the way
>>> Do we know the root cause why this function
would ruin MEC ?
Only we call this function right after VF FLR will ruin
MEC and lead to following KIQ ring test fail , and on
bare-metal it is called before gpu rest , so that's why
on bare-metal we don't have this issue
But the reason we cannot call it before VF FLR on SRIOV
case was already stated in this thread
Thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD
-----Original Message-----
From: Liu, Monk
Sent: Thursday, December 19, 2019 11:49 AM
To: shaoyunl <shaoyun.liu@xxxxxxx>;
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: RE: [PATCH 2/2] drm/amdgpu: fix KIQ ring test
fail in TDR of SRIOV
Hi Shaoyun
>>> Do we know the root cause why this function
would ruin MEC ? From the logic, I think this function
should be called before FLR since we need to disable the
user queue submission first.
Right now I don't know which detail step lead to KIQ
ring test fail, I totally agree with you that this func
should be called before VF FLR, but we cannot do it and
the reason is described in The comment:
> if we do pre_reset() before VF FLR, it would go KIQ
way to do register
> access and stuck there, because KIQ probably won't
work by that time
> (e.g. you already made GFX hang)
>>> I remembered the function should use hiq to
communicate with HW , shouldn't use kiq to access HW
registerm, has this been changed ?
Tis function use WREG32/RREG32 to do register access,
like all other functions in KMD, and WREG32/RREG32 will
let KIQ to do the register access If we are under
dynamic SRIOV mode (means we are SRIOV VF and isn't
under full exclusive mode)
You see that if you call this func before EVENT_5 (event
5 triggers VF FLR) then it will run under dynamic mode
and KIQ will handle the register access, which is not an
option Since ME/MEC probably already hang ( if we are
testing quark on gfx/compute rings)
Do you have a good suggestion ?
thanks
_____________________________________
Monk Liu|GPU Virtualization Team |AMD
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx>
On Behalf Of shaoyunl
Sent: Tuesday, December 17, 2019 11:38 PM
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH 2/2] drm/amdgpu: fix KIQ ring test
fail in TDR of SRIOV
I think amdkfd side depends on this call to stop the
user queue, without this call, the user queue can submit
to HW during the reset which could cause hang again ...
Do we know the root cause why this function would ruin
MEC ? From the logic, I think this function should be
called before FLR since we need to disable the user
queue submission first.
I remembered the function should use hiq to communicate
with HW , shouldn't use kiq to access HW registerm, has
this been changed ?
Regards
shaoyun.liu
On 2019-12-17 5:19 a.m., Monk Liu wrote:
> issues:
> MEC is ruined by the amdkfd_pre_reset after VF FLR
done
>
> fix:
> amdkfd_pre_reset() would ruin MEC after hypervisor
finished the VF
> FLR, the correct sequence is do amdkfd_pre_reset
before VF FLR but
> there is a limitation to block this sequence:
> if we do pre_reset() before VF FLR, it would go KIQ
way to do register
> access and stuck there, because KIQ probably won't
work by that time
> (e.g. you already made GFX hang)
>
> so the best way right now is to simply remove it.
>
> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 605cef6..ae962b9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3672,8 +3672,6 @@ static int
amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> if (r)
> return r;
>
> - amdgpu_amdkfd_pre_reset(adev);
> -
> /* Resume IP prior to SMC */
> r =
amdgpu_device_ip_reinit_early_sriov(adev);
> if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url="">