Re: [PATCH]] drm/amdgpu/gfx9: add gfxoff quirk

Daniel Gomez <daniel@xxxxxxxx> · Thu, 11 Mar 2021 14:48:52 +0100

On Thu, 11 Mar 2021 at 10:09, Daniel Gomez <daniel@xxxxxxxx> wrote:
>
> On Wed, 10 Mar 2021 at 18:06, Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
> >
> > On Wed, Mar 10, 2021 at 11:37 AM Daniel Gomez <daniel@xxxxxxxx> wrote:
> > >
> > > Disabling GFXOFF via the quirk list fixes a hardware lockup in
> > > Ryzen V1605B, RAVEN 0x1002:0x15DD rev 0x83.
> > >
> > > Signed-off-by: Daniel Gomez <daniel@xxxxxxxx>
> > > ---
> > >
> > > This patch is a continuation of the work here:
> > > https://lkml.org/lkml/2021/2/3/122 where a hardware lockup was discussed and
> > > a dma_fence deadlock was provoke as a side effect. To reproduce the issue
> > > please refer to the above link.
> > >
> > > The hardware lockup was introduced in 5.6-rc1 for our particular revision as it
> > > wasn't part of the new blacklist. Before that, in kernel v5.5, this hardware was
> > > working fine without any hardware lock because the GFXOFF was actually disabled
> > > by the if condition for the CHIP_RAVEN case. So this patch, adds the 'Radeon
> > > Vega Mobile Series [1002:15dd] (rev 83)' to the blacklist to disable the GFXOFF.
> > >
> > > But besides the fix, I'd like to ask from where this revision comes from. Is it
> > > an ASIC revision or is it hardcoded in the VBIOS from our vendor? From what I
> > > can see, it comes from the ASIC and I wonder if somehow we can get an APU in the
> > > future, 'not blacklisted', with the same problem. Then, should this table only
> > > filter for the vendor and device and not the revision? Do you know if there are
> > > any revisions for the 1002:15dd validated, tested and functional?
> >
> > The pci revision id (RID) is used to specify the specific SKU within a
> > family.  GFXOFF is supposed to be working on all raven variants.  It
> > was tested and functional on all reference platforms and any OEM
> > platforms that launched with Linux support.  There are a lot of
> > dependencies on sbios in the early raven variants (0x15dd), so it's
> > likely more of a specific platform issue, but there is not a good way
> > to detect this so we use the DID/SSID/RID as a proxy.  The newer raven
> > variants (0x15d8) have much better GFXOFF support since they all
> > shipped with newer firmware and sbios.
>
> We took one of the first reference platform boards to design our
> custom board based on the V1605B and I assume it has one of the early 'unstable'
> raven variants with RID 0x83. Also, as OEM we are in control of the bios
> (provided by insyde) but I wasn't sure about the RID so, thanks for the
> clarification. Is there anything we can do with the bios to have the GFXOFF
> enabled and 'stable' for this particular revision? Otherwise we'd need to add
> the 0x83 RID to the table. Also, there is an extra ']' in the patch
> subject. Sorry
> for that. Would you need a new patch in case you accept it with the ']' removed?
>
> Good to hear that the newer raven versions have better GFXOFF support.

Adding Alex Desnoyer to the loop as he is the electronic/hardware and
bios responsible so, he can
provide more information about this.

I've now done a test on the reference platform (dibbler) with the
latest bios available
and the hw lockup can be also reproduced with the same steps.

For reference, I'm using mainline kernel 5.12-rc2.

[    5.938544] [drm] initializing kernel modesetting (RAVEN
0x1002:0x15DD 0x1002:0x15DD 0xC1).
[    5.939942] amdgpu: ATOM BIOS: 113-RAVEN-11

As in the previous cases, the clocks go to 100% of usage when the hang occurs.

However, when the gpu hangs, dmesg output displays the following:

[ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, signaled seq=188, emitted seq=191
[ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process Xorg pid 311 thread Xorg:cs0 pid 312
[ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, signaled seq=188, emitted seq=191
[ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process Xorg pid 311 thread Xorg:cs0 pid 312
[ 1568.507000] amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
[ 1628.491882] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1628.491882] rcu:     3-...!: (665 ticks this GP)
idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15
[ 1628.491882] rcu: rcu_sched kthread timer wakeup didn't happen for
58497 jiffies! g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 1628.491882] rcu:     Possible timer handling issue on cpu=2
timer-softirq=55225
[ 1628.491882] rcu: rcu_sched kthread starved for 58500 jiffies!
g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[ 1628.491882] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[ 1628.491882] rcu: RCU grace-period kthread stack dump:
[ 1628.491882] rcu: Stack dump where RCU GP kthread last ran:
[ 1808.518445] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1808.518445] rcu:     3-...!: (2643 ticks this GP)
idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15
[ 1808.518445] rcu: rcu_sched kthread starved for 238526 jiffies!
g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=2
[ 1808.518445] rcu:     Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[ 1808.518445] rcu: RCU grace-period kthread stack dump:
[ 1808.518445] rcu: Stack dump where RCU GP kthread last ran:

>
> Daniel
>
> >
> > Alex
> >
> >
> > >
> > > Logs:
> > > [   27.708348] [drm] initializing kernel modesetting (RAVEN
> > > 0x1002:0x15DD 0x1002:0x15DD 0x83).
> > > [   27.789156] amdgpu: ATOM BIOS: 113-RAVEN-115
> > >
> > > Thanks in advance,
> > > Daniel
> > >
> > >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > index 65db88bb6cbc..319d4b99aec8 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > @@ -1243,6 +1243,8 @@ static const struct amdgpu_gfxoff_quirk amdgpu_gfxoff_quirk_list[] = {
> > >         { 0x1002, 0x15dd, 0x103c, 0x83e7, 0xd3 },
> > >         /* GFXOFF is unstable on C6 parts with a VBIOS 113-RAVEN-114 */
> > >         { 0x1002, 0x15dd, 0x1002, 0x15dd, 0xc6 },
> > > +       /* GFXOFF provokes a hw lockup on 83 parts with a VBIOS 113-RAVEN-115 */
> > > +       { 0x1002, 0x15dd, 0x1002, 0x15dd, 0x83 },
> > >         { 0, 0, 0, 0, 0 },
> > >  };
> > >
> > > --
> > > 2.30.1
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@xxxxxxxxxxxxxxxxxxxxx
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel