Regression on gfx8 with ring init

christian.koenig@xxxxxxx (Christian König) · Tue, 18 Sep 2018 17:00:01 +0200

Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring 
*ring, long timeout)
 Â Â Â Â Â Â Â  return r;
 Â }

+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
timeout)
+{
+Â Â Â Â Â Â  return 0;
+}

 Â static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 Â {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
gfx_v8_0_ring_funcs_kiq = {
 Â Â Â Â Â Â Â  .emit_ib = gfx_v8_0_ring_emit_ib_compute,
 Â Â Â Â Â Â Â  .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
 Â Â Â Â Â Â Â  .test_ring = gfx_v8_0_ring_test_ring,
-Â Â Â Â Â Â  .test_ib = gfx_v8_0_ring_test_ib,
+Â Â Â Â Â Â  .test_ib = gfx_v8_0_kiq_ring_test_ib,
 Â Â Â Â Â Â Â  .insert_nop = amdgpu_ring_insert_nop,
 Â Â Â Â Â Â Â  .pad_ib = amdgpu_ring_generic_pad_ib,
 Â Â Â Â Â Â Â  .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian KÃ¶nig:
> CRTC and GFX interrupts seem to be working perfectly fine.
>
> The problem here looks like only EOP interrupts from the Compute queue 
> are not correctly handled.
>
> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>
> Christian.
>
> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>
>> FWIW, a number of consumer Raven boards have bad IVRS tables (windows 
>> doesn't use interrupt remapping so they are sometimes wrong and 
>> probably not validated.Â  There are a number of workaround to manually 
>> override the IVRS tables to make interrupts work. I think specifying 
>> pci=noacpi is also a possible workaround.
>>
>>
>> Alex
>>
>> ------------------------------------------------------------------------
>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of 
>> Christian KÃ¶nig <christian.koenig at amd.com>
>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>> *Subject:* Re: Regression on gfx8 with ring init
>> Well looks like interrupt processing is working perfectly fine.
>>
>> But looking at the error message once more I see that this actually
>> affects ring number 9 and not the GFX ring.
>>
>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
>> number?
>>
>> That must be some of the compute rings.
>>
>> Thanks,
>> Christian.
>>
>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>> > On 2018-09-18 10:13 a.m., Christian KÃ¶nig wrote:
>> >> Mhm, there is no more failed IB-test in there isn't it?
>> >
>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a log from
>> > the tip of drm-next
>> >
>> > Tom
>> >
>> >>
>> >> Christian.
>> >>
>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>> >>>
>> >>> Here's the log.
>> >>>
>> >>> Tom
>> >>>
>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>> >>>> Odd I couldn't even boot my system with the dGPU as primary after
>> >>>> rebuilding the kernel.Â  It got hung up in the IOMMU driver (loads
>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>> >>>> panic'ed before loading the network stack.
>> >>>>
>> >>>> Bizarre.
>> >>>>
>> >>>> I'll keep trying.
>> >>>>
>> >>>> Tom
>> >>>>
>> >>>> On 2018-09-18 9:35 a.m., Christian KÃ¶nig wrote:
>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>> >>>>>> On 2018-09-18 9:30 a.m., Christian KÃ¶nig wrote:
>> >>>>>>> Great, not sure if that is a good or a bad news.
>> >>>>>>>
>> >>>>>>> Anyway going to revert the change for now. Does anybody
>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>> >>>>>>> correctly on Raven?
>> >>>>>>
>> >>>>>> What does "doesn't work correctly?"Â  My workstation is a Raven1
>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>> >>>>>>
>> >>>>>> Anything I could test with my devel raven?
>> >>>>>
>> >>>>> The problem seems to be that on some boards IH handling doesn't
>> >>>>> work as it should.
>> >>>>>
>> >>>>> Can you try to disable the onboard graphics and try again?
>> >>>>>
>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Christian.
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Tom
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Christian.
>> >>>>>>>
>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>> >>>>>>>> This commit:
>> >>>>>>>>
>> >>>>>>>> [root at raven linux]# git bisect good
>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>> >>>>>>>> Author: Christian KÃ¶nig <christian.koenig at amd.com>
>> >>>>>>>> Date:Â Â  Tue Sep 18 10:38:09 2018 +0200
>> >>>>>>>>
>> >>>>>>>> Â Â Â  drm/amdgpu: remove fence fallback
>> >>>>>>>>
>> >>>>>>>> Â Â Â  DC doesn't seem to have a fallback path either.
>> >>>>>>>>
>> >>>>>>>> Â Â Â  So when interrupts doesn't work any more we are pretty much
>> >>>>>>>> busted no
>> >>>>>>>> Â Â Â  matter what.
>> >>>>>>>>
>> >>>>>>>> Â Â Â  Signed-off-by: Christian KÃ¶nig <christian.koenig at amd.com>
>> >>>>>>>> Â Â Â  Reviewed-by: Chunming Zhou <david1.zhou at amd.com>
>> >>>>>>>>
>> >>>>>>>> Results in this:
>> >>>>>>>>
>> >>>>>>>> [Â Â  24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>> >>>>>>>> 0000:07:00.0 on minor 1
>> >>>>>>>> [Â Â  24.335674] modprobe (3895) used greatest stack depth: 12600
>> >>>>>>>> bytes left
>> >>>>>>>> [Â Â  26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>> >>>>>>>> amdgpu: IB test timed out.
>> >>>>>>>> [Â Â  26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>> >>>>>>>> [Â Â  26.407885] [drm:process_one_work] *ERROR* ib ring test
>> >>>>>>>> failed (-110).
>> >>>>>>>> [Â Â  28.506708] fuse init (API version 7.27)
>> >>>>>>>>
>> >>>>>>>> On init with my polaris/raven1 system.
>> >>>>>>>>
>> >>>>>>>> Cheers,
>> >>>>>>>> Tom
>> >>>>>>>> _______________________________________________
>> >>>>>>>> amd-gfx mailing list
>> >>>>>>>> amd-gfx at lists.freedesktop.org
>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180918/d34b374f/attachment-0001.html>