Regression on gfx8 with ring init

Andrey.Grodzovsky@xxxxxxx (Andrey Grodzovsky) · Fri, 21 Sep 2018 14:04:28 -0400

BTW, this also seems to be what breaks suspend/resume.

Andrey

On 09/21/2018 01:56 PM, Andrey Grodzovsky wrote:
>
> No worries, I will just revert locally until then to clear the extra 
> errors during my investigation of current GPU reset status and issues.
>
>
> Andrey
>
>
> On 09/21/2018 01:53 PM, Christian KÃ¶nig wrote:
>> I unfortunately don't have a Polaris to test this myself.
>>
>> But please give me time till Monday so that I can at least try one 
>> more things to fix it.
>>
>> Christian.
>>
>> Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:
>>>
>>> Ping...
>>>
>>>
>>> Andrey
>>>
>>>
>>> On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:
>>>>
>>>> What's the status with this error and the suggested patch to fix it 
>>>> ? It impacts GPU reset on Polaris11.
>>>>
>>>> Do we want to investigate why the original patch breaks it or just 
>>>> disable with the proposed patch ?
>>>>
>>>>
>>>> P.S Suspend resume also stopped working on latest branch - will 
>>>> bisect it later today or tomorrow.
>>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 09/18/2018 11:00 AM, Christian KÃ¶nig wrote:
>>>>> Tom,
>>>>>
>>>>> can you try if the following makes it working again?
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>>> index b6160de70d12..d65f5ba92fc5 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>>>>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
>>>>> amdgpu_ring *ring, long timeout)
>>>>> Â Â Â Â Â Â Â  return r;
>>>>> Â }
>>>>>
>>>>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, 
>>>>> long timeout)
>>>>> +{
>>>>> +Â Â Â Â Â Â  return 0;
>>>>> +}
>>>>>
>>>>> Â static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>>>>> Â {
>>>>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
>>>>> gfx_v8_0_ring_funcs_kiq = {
>>>>> Â Â Â Â Â Â Â  .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>>>>> Â Â Â Â Â Â Â  .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>>>>> Â Â Â Â Â Â Â  .test_ring = gfx_v8_0_ring_test_ring,
>>>>> -Â Â Â Â Â Â  .test_ib = gfx_v8_0_ring_test_ib,
>>>>> +Â Â Â Â Â Â  .test_ib = gfx_v8_0_kiq_ring_test_ib,
>>>>> Â Â Â Â Â Â Â  .insert_nop = amdgpu_ring_insert_nop,
>>>>> Â Â Â Â Â Â Â  .pad_ib = amdgpu_ring_generic_pad_ib,
>>>>> Â Â Â Â Â Â Â  .emit_rreg = gfx_v8_0_ring_emit_rreg,
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>> Am 18.09.2018 um 16:41 schrieb Christian KÃ¶nig:
>>>>>> CRTC and GFX interrupts seem to be working perfectly fine.
>>>>>>
>>>>>> The problem here looks like only EOP interrupts from the Compute 
>>>>>> queue are not correctly handled.
>>>>>>
>>>>>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>>>>>
>>>>>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>>>>>> (windows doesn't use interrupt remapping so they are sometimes 
>>>>>>> wrong and probably not validated.Â  There are a number of 
>>>>>>> workaround to manually override the IVRS tables to make 
>>>>>>> interrupts work.Â  I think specifying pci=noacpi is also a 
>>>>>>> possible workaround.
>>>>>>>
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> ------------------------------------------------------------------------
>>>>>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on 
>>>>>>> behalf of Christian KÃ¶nig <christian.koenig at amd.com>
>>>>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>>>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>>>>>> *Subject:* Re: Regression on gfx8 with ring init
>>>>>>> Well looks like interrupt processing is working perfectly fine.
>>>>>>>
>>>>>>> But looking at the error message once more I see that this actually
>>>>>>> affects ring number 9 and not the GFX ring.
>>>>>>>
>>>>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead 
>>>>>>> of the
>>>>>>> number?
>>>>>>>
>>>>>>> That must be some of the compute rings.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>>>>>> > On 2018-09-18 10:13 a.m., Christian KÃ¶nig wrote:
>>>>>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>>>>>> >
>>>>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a 
>>>>>>> log from
>>>>>>> > the tip of drm-next
>>>>>>> >
>>>>>>> > Tom
>>>>>>> >
>>>>>>> >>
>>>>>>> >> Christian.
>>>>>>> >>
>>>>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>>>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>>>>> >>>
>>>>>>> >>> Here's the log.
>>>>>>> >>>
>>>>>>> >>> Tom
>>>>>>> >>>
>>>>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary 
>>>>>>> after
>>>>>>> >>>> rebuilding the kernel.Â  It got hung up in the IOMMU driver 
>>>>>>> (loads
>>>>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture 
>>>>>>> because it
>>>>>>> >>>> panic'ed before loading the network stack.
>>>>>>> >>>>
>>>>>>> >>>> Bizarre.
>>>>>>> >>>>
>>>>>>> >>>> I'll keep trying.
>>>>>>> >>>>
>>>>>>> >>>> Tom
>>>>>>> >>>>
>>>>>>> >>>> On 2018-09-18 9:35 a.m., Christian KÃ¶nig wrote:
>>>>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian KÃ¶nig wrote:
>>>>>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't 
>>>>>>> work
>>>>>>> >>>>>>> correctly on Raven?
>>>>>>> >>>>>>
>>>>>>> >>>>>> What does "doesn't work correctly?"Â  My workstation is a 
>>>>>>> Raven1
>>>>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has 
>>>>>>> been
>>>>>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>> >>>>>>
>>>>>>> >>>>>> Anything I could test with my devel raven?
>>>>>>> >>>>>
>>>>>>> >>>>> The problem seems to be that on some boards IH handling 
>>>>>>> doesn't
>>>>>>> >>>>> work as it should.
>>>>>>> >>>>>
>>>>>>> >>>>> Can you try to disable the onboard graphics and try again?
>>>>>>> >>>>>
>>>>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>>> >>>>>
>>>>>>> >>>>> Thanks,
>>>>>>> >>>>> Christian.
>>>>>>> >>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> Tom
>>>>>>> >>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Christian.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>>> >>>>>>>> This commit:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> [root at raven linux]# git bisect good
>>>>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first 
>>>>>>> bad commit
>>>>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>>> >>>>>>>> Author: Christian KÃ¶nig <christian.koenig at amd.com>
>>>>>>> >>>>>>>> Date:Â Â  Tue Sep 18 10:38:09 2018 +0200
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> drm/amdgpu: remove fence fallback
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Â Â Â  DC doesn't seem to have a fallback path either.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Â Â Â  So when interrupts doesn't work any more we are 
>>>>>>> pretty much
>>>>>>> >>>>>>>> busted no
>>>>>>> >>>>>>>> Â Â Â  matter what.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Signed-off-by: Christian KÃ¶nig <christian.koenig at amd.com>
>>>>>>> >>>>>>>> Reviewed-by: Chunming Zhou <david1.zhou at amd.com>
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Results in this:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>>>>> >>>>>>>> 0000:07:00.0 on minor 1
>>>>>>> >>>>>>>> [ 24.335674] modprobe (3895) used greatest stack depth: 
>>>>>>> 12600
>>>>>>> >>>>>>>> bytes left
>>>>>>> >>>>>>>> [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>>>>> >>>>>>>> amdgpu: IB test timed out.
>>>>>>> >>>>>>>> [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>>>> >>>>>>>> [ 26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>>>>> >>>>>>>> failed (-110).
>>>>>>> >>>>>>>> [ 28.506708] fuse init (API version 7.27)
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> On init with my polaris/raven1 system.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Cheers,
>>>>>>> >>>>>>>> Tom
>>>>>>> >>>>>>>> _______________________________________________
>>>>>>> >>>>>>>> amd-gfx mailing list
>>>>>>> >>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>> >>>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>
>>>>>>> >>>>
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180921/db4113c2/attachment-0001.html>