On Thu, 11 Mar 2021 at 10:09, Daniel Gomez <daniel@xxxxxxxx> wrote: > > On Wed, 10 Mar 2021 at 18:06, Alex Deucher <alexdeucher@xxxxxxxxx> wrote: > > > > On Wed, Mar 10, 2021 at 11:37 AM Daniel Gomez <daniel@xxxxxxxx> wrote: > > > > > > Disabling GFXOFF via the quirk list fixes a hardware lockup in > > > Ryzen V1605B, RAVEN 0x1002:0x15DD rev 0x83. > > > > > > Signed-off-by: Daniel Gomez <daniel@xxxxxxxx> > > > --- > > > > > > This patch is a continuation of the work here: > > > https://lkml.org/lkml/2021/2/3/122 where a hardware lockup was discussed and > > > a dma_fence deadlock was provoke as a side effect. To reproduce the issue > > > please refer to the above link. > > > > > > The hardware lockup was introduced in 5.6-rc1 for our particular revision as it > > > wasn't part of the new blacklist. Before that, in kernel v5.5, this hardware was > > > working fine without any hardware lock because the GFXOFF was actually disabled > > > by the if condition for the CHIP_RAVEN case. So this patch, adds the 'Radeon > > > Vega Mobile Series [1002:15dd] (rev 83)' to the blacklist to disable the GFXOFF. > > > > > > But besides the fix, I'd like to ask from where this revision comes from. Is it > > > an ASIC revision or is it hardcoded in the VBIOS from our vendor? From what I > > > can see, it comes from the ASIC and I wonder if somehow we can get an APU in the > > > future, 'not blacklisted', with the same problem. Then, should this table only > > > filter for the vendor and device and not the revision? Do you know if there are > > > any revisions for the 1002:15dd validated, tested and functional? > > > > The pci revision id (RID) is used to specify the specific SKU within a > > family. GFXOFF is supposed to be working on all raven variants. It > > was tested and functional on all reference platforms and any OEM > > platforms that launched with Linux support. There are a lot of > > dependencies on sbios in the early raven variants (0x15dd), so it's > > likely more of a specific platform issue, but there is not a good way > > to detect this so we use the DID/SSID/RID as a proxy. The newer raven > > variants (0x15d8) have much better GFXOFF support since they all > > shipped with newer firmware and sbios. > > We took one of the first reference platform boards to design our > custom board based on the V1605B and I assume it has one of the early 'unstable' > raven variants with RID 0x83. Also, as OEM we are in control of the bios > (provided by insyde) but I wasn't sure about the RID so, thanks for the > clarification. Is there anything we can do with the bios to have the GFXOFF > enabled and 'stable' for this particular revision? Otherwise we'd need to add > the 0x83 RID to the table. Also, there is an extra ']' in the patch > subject. Sorry > for that. Would you need a new patch in case you accept it with the ']' removed? > > Good to hear that the newer raven versions have better GFXOFF support. Adding Alex Desnoyer to the loop as he is the electronic/hardware and bios responsible so, he can provide more information about this. I've now done a test on the reference platform (dibbler) with the latest bios available and the hw lockup can be also reproduced with the same steps. For reference, I'm using mainline kernel 5.12-rc2. [ 5.938544] [drm] initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x1002:0x15DD 0xC1). [ 5.939942] amdgpu: ATOM BIOS: 113-RAVEN-11 As in the previous cases, the clocks go to 100% of usage when the hang occurs. However, when the gpu hangs, dmesg output displays the following: [ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=188, emitted seq=191 [ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 311 thread Xorg:cs0 pid 312 [ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=188, emitted seq=191 [ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 311 thread Xorg:cs0 pid 312 [ 1568.507000] amdgpu 0000:01:00.0: amdgpu: GPU reset begin! [ 1628.491882] rcu: INFO: rcu_sched self-detected stall on CPU [ 1628.491882] rcu: 3-...!: (665 ticks this GP) idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15 [ 1628.491882] rcu: rcu_sched kthread timer wakeup didn't happen for 58497 jiffies! g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 1628.491882] rcu: Possible timer handling issue on cpu=2 timer-softirq=55225 [ 1628.491882] rcu: rcu_sched kthread starved for 58500 jiffies! g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2 [ 1628.491882] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 1628.491882] rcu: RCU grace-period kthread stack dump: [ 1628.491882] rcu: Stack dump where RCU GP kthread last ran: [ 1808.518445] rcu: INFO: rcu_sched self-detected stall on CPU [ 1808.518445] rcu: 3-...!: (2643 ticks this GP) idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15 [ 1808.518445] rcu: rcu_sched kthread starved for 238526 jiffies! g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=2 [ 1808.518445] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 1808.518445] rcu: RCU grace-period kthread stack dump: [ 1808.518445] rcu: Stack dump where RCU GP kthread last ran: > > Daniel > > > > > Alex > > > > > > > > > > Logs: > > > [ 27.708348] [drm] initializing kernel modesetting (RAVEN > > > 0x1002:0x15DD 0x1002:0x15DD 0x83). > > > [ 27.789156] amdgpu: ATOM BIOS: 113-RAVEN-115 > > > > > > Thanks in advance, > > > Daniel > > > > > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > > index 65db88bb6cbc..319d4b99aec8 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c > > > @@ -1243,6 +1243,8 @@ static const struct amdgpu_gfxoff_quirk amdgpu_gfxoff_quirk_list[] = { > > > { 0x1002, 0x15dd, 0x103c, 0x83e7, 0xd3 }, > > > /* GFXOFF is unstable on C6 parts with a VBIOS 113-RAVEN-114 */ > > > { 0x1002, 0x15dd, 0x1002, 0x15dd, 0xc6 }, > > > + /* GFXOFF provokes a hw lockup on 83 parts with a VBIOS 113-RAVEN-115 */ > > > + { 0x1002, 0x15dd, 0x1002, 0x15dd, 0x83 }, > > > { 0, 0, 0, 0, 0 }, > > > }; > > > > > > -- > > > 2.30.1 > > > > > > _______________________________________________ > > > dri-devel mailing list > > > dri-devel@xxxxxxxxxxxxxxxxxxxxx > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel