Regression on gfx8 with ring init

Andrey.Grodzovsky@xxxxxxx (Andrey Grodzovsky) · Fri, 21 Sep 2018 13:11:45 -0400

Ping...

Andrey

On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:
>
> What's the status with this error and the suggested patch to fix it ? 
> It impacts GPU reset on Polaris11.
>
> Do we want to investigate why the original patch breaks it or just 
> disable with the proposed patch ?
>
>
> P.S Suspend resume also stopped working on latest branch - will bisect 
> it later today or tomorrow.
>
>
> Andrey
>
>
> On 09/18/2018 11:00 AM, Christian KÃ¶nig wrote:
>> Tom,
>>
>> can you try if the following makes it working again?
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> index b6160de70d12..d65f5ba92fc5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
>> amdgpu_ring *ring, long timeout)
>> Â Â Â Â Â Â Â  return r;
>> Â }
>>
>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
>> timeout)
>> +{
>> +Â Â Â Â Â Â  return 0;
>> +}
>>
>> Â static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>> Â {
>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
>> gfx_v8_0_ring_funcs_kiq = {
>> Â Â Â Â Â Â Â  .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>> Â Â Â Â Â Â Â  .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>> Â Â Â Â Â Â Â  .test_ring = gfx_v8_0_ring_test_ring,
>> -Â Â Â Â Â Â  .test_ib = gfx_v8_0_ring_test_ib,
>> +Â Â Â Â Â Â  .test_ib = gfx_v8_0_kiq_ring_test_ib,
>> Â Â Â Â Â Â Â  .insert_nop = amdgpu_ring_insert_nop,
>> Â Â Â Â Â Â Â  .pad_ib = amdgpu_ring_generic_pad_ib,
>> Â Â Â Â Â Â Â  .emit_rreg = gfx_v8_0_ring_emit_rreg,
>>
>>
>> Thanks,
>> Christian.
>>
>> Am 18.09.2018 um 16:41 schrieb Christian KÃ¶nig:
>>> CRTC and GFX interrupts seem to be working perfectly fine.
>>>
>>> The problem here looks like only EOP interrupts from the Compute 
>>> queue are not correctly handled.
>>>
>>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>>
>>> Christian.
>>>
>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>>
>>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>>> (windows doesn't use interrupt remapping so they are sometimes 
>>>> wrong and probably not validated.Â  There are a number of workaround 
>>>> to manually override the IVRS tables to make interrupts work.Â  I 
>>>> think specifying pci=noacpi is also a possible workaround.
>>>>
>>>>
>>>> Alex
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf 
>>>> of Christian KÃ¶nig <christian.koenig at amd.com>
>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>>> *Subject:* Re: Regression on gfx8 with ring init
>>>> Well looks like interrupt processing is working perfectly fine.
>>>>
>>>> But looking at the error message once more I see that this actually
>>>> affects ring number 9 and not the GFX ring.
>>>>
>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
>>>> number?
>>>>
>>>> That must be some of the compute rings.
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>>> > On 2018-09-18 10:13 a.m., Christian KÃ¶nig wrote:
>>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>>> >
>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a log 
>>>> from
>>>> > the tip of drm-next
>>>> >
>>>> > Tom
>>>> >
>>>> >>
>>>> >> Christian.
>>>> >>
>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>> >>>
>>>> >>> Here's the log.
>>>> >>>
>>>> >>> Tom
>>>> >>>
>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> >>>> rebuilding the kernel.Â  It got hung up in the IOMMU driver (loads
>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> >>>> panic'ed before loading the network stack.
>>>> >>>>
>>>> >>>> Bizarre.
>>>> >>>>
>>>> >>>> I'll keep trying.
>>>> >>>>
>>>> >>>> Tom
>>>> >>>>
>>>> >>>> On 2018-09-18 9:35 a.m., Christian KÃ¶nig wrote:
>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian KÃ¶nig wrote:
>>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>>> >>>>>>>
>>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>> >>>>>>> correctly on Raven?
>>>> >>>>>>
>>>> >>>>>> What does "doesn't work correctly?"Â  My workstation is a Raven1
>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>>> >>>>>>
>>>> >>>>>> Anything I could test with my devel raven?
>>>> >>>>>
>>>> >>>>> The problem seems to be that on some boards IH handling doesn't
>>>> >>>>> work as it should.
>>>> >>>>>
>>>> >>>>> Can you try to disable the onboard graphics and try again?
>>>> >>>>>
>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>> >>>>>
>>>> >>>>> Thanks,
>>>> >>>>> Christian.
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Tom
>>>> >>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Christian.
>>>> >>>>>>>
>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>> >>>>>>>> This commit:
>>>> >>>>>>>>
>>>> >>>>>>>> [root at raven linux]# git bisect good
>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad 
>>>> commit
>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>> >>>>>>>> Author: Christian KÃ¶nig <christian.koenig at amd.com>
>>>> >>>>>>>> Date:Â Â  Tue Sep 18 10:38:09 2018 +0200
>>>> >>>>>>>>
>>>> >>>>>>>> Â Â Â  drm/amdgpu: remove fence fallback
>>>> >>>>>>>>
>>>> >>>>>>>> Â Â Â  DC doesn't seem to have a fallback path either.
>>>> >>>>>>>>
>>>> >>>>>>>> Â Â Â  So when interrupts doesn't work any more we are pretty 
>>>> much
>>>> >>>>>>>> busted no
>>>> >>>>>>>> Â Â Â  matter what.
>>>> >>>>>>>>
>>>> >>>>>>>> Â Â Â  Signed-off-by: Christian KÃ¶nig <christian.koenig at amd.com>
>>>> >>>>>>>> Â Â Â  Reviewed-by: Chunming Zhou <david1.zhou at amd.com>
>>>> >>>>>>>>
>>>> >>>>>>>> Results in this:
>>>> >>>>>>>>
>>>> >>>>>>>> [Â Â  24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>> >>>>>>>> 0000:07:00.0 on minor 1
>>>> >>>>>>>> [Â Â  24.335674] modprobe (3895) used greatest stack depth: 
>>>> 12600
>>>> >>>>>>>> bytes left
>>>> >>>>>>>> [Â Â  26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>> >>>>>>>> amdgpu: IB test timed out.
>>>> >>>>>>>> [Â Â  26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>> >>>>>>>> [Â Â  26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>> >>>>>>>> failed (-110).
>>>> >>>>>>>> [Â Â  28.506708] fuse init (API version 7.27)
>>>> >>>>>>>>
>>>> >>>>>>>> On init with my polaris/raven1 system.
>>>> >>>>>>>>
>>>> >>>>>>>> Cheers,
>>>> >>>>>>>> Tom
>>>> >>>>>>>> _______________________________________________
>>>> >>>>>>>> amd-gfx mailing list
>>>> >>>>>>>> amd-gfx at lists.freedesktop.org
>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180921/cb34b120/attachment-0001.html>