BTW, this also seems to be what breaks suspend/resume. Andrey On 09/21/2018 01:56 PM, Andrey Grodzovsky wrote: > > No worries, I will just revert locally until then to clear the extra > errors during my investigation of current GPU reset status and issues. > > > Andrey > > > On 09/21/2018 01:53 PM, Christian König wrote: >> I unfortunately don't have a Polaris to test this myself. >> >> But please give me time till Monday so that I can at least try one >> more things to fix it. >> >> Christian. >> >> Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky: >>> >>> Ping... >>> >>> >>> Andrey >>> >>> >>> On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote: >>>> >>>> What's the status with this error and the suggested patch to fix it >>>> ? It impacts GPU reset on Polaris11. >>>> >>>> Do we want to investigate why the original patch breaks it or just >>>> disable with the proposed patch ? >>>> >>>> >>>> P.S Suspend resume also stopped working on latest branch - will >>>> bisect it later today or tomorrow. >>>> >>>> >>>> Andrey >>>> >>>> >>>> On 09/18/2018 11:00 AM, Christian König wrote: >>>>> Tom, >>>>> >>>>> can you try if the following makes it working again? >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>>>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>>>> index b6160de70d12..d65f5ba92fc5 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>>>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct >>>>> amdgpu_ring *ring, long timeout) >>>>>        return r; >>>>>  } >>>>> >>>>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, >>>>> long timeout) >>>>> +{ >>>>> +      return 0; >>>>> +} >>>>> >>>>>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) >>>>>  { >>>>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs >>>>> gfx_v8_0_ring_funcs_kiq = { >>>>>        .emit_ib = gfx_v8_0_ring_emit_ib_compute, >>>>>        .emit_fence = gfx_v8_0_ring_emit_fence_kiq, >>>>>        .test_ring = gfx_v8_0_ring_test_ring, >>>>> -      .test_ib = gfx_v8_0_ring_test_ib, >>>>> +      .test_ib = gfx_v8_0_kiq_ring_test_ib, >>>>>        .insert_nop = amdgpu_ring_insert_nop, >>>>>        .pad_ib = amdgpu_ring_generic_pad_ib, >>>>>        .emit_rreg = gfx_v8_0_ring_emit_rreg, >>>>> >>>>> >>>>> Thanks, >>>>> Christian. >>>>> >>>>> Am 18.09.2018 um 16:41 schrieb Christian König: >>>>>> CRTC and GFX interrupts seem to be working perfectly fine. >>>>>> >>>>>> The problem here looks like only EOP interrupts from the Compute >>>>>> queue are not correctly handled. >>>>>> >>>>>> Most likely a bug somewhere in gfx_v8_0_eop_irq(). >>>>>> >>>>>> Christian. >>>>>> >>>>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: >>>>>>> >>>>>>> FWIW, a number of consumer Raven boards have bad IVRS tables >>>>>>> (windows doesn't use interrupt remapping so they are sometimes >>>>>>> wrong and probably not validated. There are a number of >>>>>>> workaround to manually override the IVRS tables to make >>>>>>> interrupts work. I think specifying pci=noacpi is also a >>>>>>> possible workaround. >>>>>>> >>>>>>> >>>>>>> Alex >>>>>>> >>>>>>> ------------------------------------------------------------------------ >>>>>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on >>>>>>> behalf of Christian König <christian.koenig at amd.com> >>>>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM >>>>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) >>>>>>> *Subject:* Re: Regression on gfx8 with ring init >>>>>>> Well looks like interrupt processing is working perfectly fine. >>>>>>> >>>>>>> But looking at the error message once more I see that this actually >>>>>>> affects ring number 9 and not the GFX ring. >>>>>>> >>>>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead >>>>>>> of the >>>>>>> number? >>>>>>> >>>>>>> That must be some of the compute rings. >>>>>>> >>>>>>> Thanks, >>>>>>> Christian. >>>>>>> >>>>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis: >>>>>>> > On 2018-09-18 10:13 a.m., Christian König wrote: >>>>>>> >> Mhm, there is no more failed IB-test in there isn't it? >>>>>>> > >>>>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a >>>>>>> log from >>>>>>> > the tip of drm-next >>>>>>> > >>>>>>> > Tom >>>>>>> > >>>>>>> >> >>>>>>> >> Christian. >>>>>>> >> >>>>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>>>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>>>>>> >>> >>>>>>> >>> Here's the log. >>>>>>> >>> >>>>>>> >>> Tom >>>>>>> >>> >>>>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary >>>>>>> after >>>>>>> >>>> rebuilding the kernel. It got hung up in the IOMMU driver >>>>>>> (loads >>>>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture >>>>>>> because it >>>>>>> >>>> panic'ed before loading the network stack. >>>>>>> >>>> >>>>>>> >>>> Bizarre. >>>>>>> >>>> >>>>>>> >>>> I'll keep trying. >>>>>>> >>>> >>>>>>> >>>> Tom >>>>>>> >>>> >>>>>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>>>> >>>>>>> Great, not sure if that is a good or a bad news. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't >>>>>>> work >>>>>>> >>>>>>> correctly on Raven? >>>>>>> >>>>>> >>>>>>> >>>>>> What does "doesn't work correctly?" My workstation is a >>>>>>> Raven1 >>>>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has >>>>>>> been >>>>>>> >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>>>> >>>>>> >>>>>>> >>>>>> Anything I could test with my devel raven? >>>>>>> >>>>> >>>>>>> >>>>> The problem seems to be that on some boards IH handling >>>>>>> doesn't >>>>>>> >>>>> work as it should. >>>>>>> >>>>> >>>>>>> >>>>> Can you try to disable the onboard graphics and try again? >>>>>>> >>>>> >>>>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in >>>>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the >>>>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD). >>>>>>> >>>>> >>>>>>> >>>>> Thanks, >>>>>>> >>>>> Christian. >>>>>>> >>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> Tom >>>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis: >>>>>>> >>>>>>>> This commit: >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> [root at raven linux]# git bisect good >>>>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first >>>>>>> bad commit >>>>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 >>>>>>> >>>>>>>> Author: Christian König <christian.koenig at amd.com> >>>>>>> >>>>>>>> Date:  Tue Sep 18 10:38:09 2018 +0200 >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> drm/amdgpu: remove fence fallback >>>>>>> >>>>>>>> >>>>>>> >>>>>>>>    DC doesn't seem to have a fallback path either. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>>    So when interrupts doesn't work any more we are >>>>>>> pretty much >>>>>>> >>>>>>>> busted no >>>>>>> >>>>>>>>    matter what. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Signed-off-by: Christian König <christian.koenig at amd.com> >>>>>>> >>>>>>>> Reviewed-by: Chunming Zhou <david1.zhou at amd.com> >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Results in this: >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> [ 24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for >>>>>>> >>>>>>>> 0000:07:00.0 on minor 1 >>>>>>> >>>>>>>> [ 24.335674] modprobe (3895) used greatest stack depth: >>>>>>> 12600 >>>>>>> >>>>>>>> bytes left >>>>>>> >>>>>>>> [ 26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* >>>>>>> >>>>>>>> amdgpu: IB test timed out. >>>>>>> >>>>>>>> [ 26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* >>>>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110). >>>>>>> >>>>>>>> [ 26.407885] [drm:process_one_work] *ERROR* ib ring test >>>>>>> >>>>>>>> failed (-110). >>>>>>> >>>>>>>> [ 28.506708] fuse init (API version 7.27) >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> On init with my polaris/raven1 system. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Cheers, >>>>>>> >>>>>>>> Tom >>>>>>> >>>>>>>> _______________________________________________ >>>>>>> >>>>>>>> amd-gfx mailing list >>>>>>> >>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>>> >>>>> >>>>>>> >>>> >>>>>>> >>> >>>>>>> >> >>>>>>> > >>>>>>> >>>>>>> _______________________________________________ >>>>>>> amd-gfx mailing list >>>>>>> amd-gfx at lists.freedesktop.org >>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> amd-gfx mailing list >>>>>>> amd-gfx at lists.freedesktop.org >>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> >>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>> >>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180921/db4113c2/attachment-0001.html>