Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

"Yin, Tianci (Rico)" <Tianci.Yin@xxxxxxx> · Fri, 28 Jan 2022 02:34:04 +0000

[AMD Official Use Only]

The error message is from HIQ dequeue procedure,  not from HCQ, so no doorbell writing.

Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072]  dump_stack+0x7d/0x9c
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651]  hqd_destroy_v10_3+0x58/0x254 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477778]  destroy_mqd+0x1e/0x30 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884]  kernel_queue_uninit+0xcf/0x100 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985]  pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127]  stop_cpsch+0x98/0x100 [amdgpu]        
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242]  kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338]  kgd2kfd_suspend+0x1b/0x20 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433]  amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529]  amdgpu_device_fini_hw+0x182/0x335 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655]  amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732]  amdgpu_pci_remove+0x27/0x40 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806]  pci_device_remove+0x3e/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809]  device_release_driver_internal+0x103/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813]  driver_detach+0x4c/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814]  bus_remove_driver+0x5c/0xd0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815]  driver_unregister+0x31/0x50
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817]  pci_unregister_driver+0x40/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818]  amdgpu_exit+0x15/0x2d1 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942]  __x64_sys_delete_module+0x147/0x260
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944]  ? exit_to_user_mode_prepare+0x41/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946]  ? ksys_write+0x67/0xe0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948]  do_syscall_64+0x40/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Regards,

Rico

From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>

Sent: Thursday, January 27, 2022 23:28

To: Yin, Tianci (Rico) <Tianci.Yin@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>

Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

The hang you're seeing is the result of a command submission of an

UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a 

doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up 

the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this 

doorbell not trigger gfxoff exit during rmmod?

Regards,

   Felix

Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):

>

> [AMD Official Use Only]

>

>

> The rmmod ops has prerequisite multi-user target and blacklist amdgpu,

> which is IGT requirement so that IGT can make itself DRM master to 

> test KMS.

> igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload

>

> From my understanding, the KFD process belongs to the regular way of 

> gfxoff exit, which doorbell writing triggers gfxoff exit. For example, 

> KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, 

> these both trigger doorbell writing(pls refer to 

> gfx_v10_0_ring_set_wptr_compute()).

>

> As to the IGT reload test, the dequeue request is not thru a cmd on a 

> ring, it directly writes CP registers, so GFX core remains in gfxoff.

>

> Thanks,

> Rico

>

> ------------------------------------------------------------------------

> *From:* Kuehling, Felix <Felix.Kuehling@xxxxxxx>

> *Sent:* Wednesday, January 26, 2022 23:08

> *To:* Yin, Tianci (Rico) <Tianci.Yin@xxxxxxx>; Wang, Yang(Kevin) 

> <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx 

> <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

> *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun 

> <Guchun.Chen@xxxxxxx>

> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

> My question is, why is this problem only seen during module unload? Why

> aren't we seeing HWS hangs due to GFX_OFF all the time in normal

> operations? For example when the GPU is idle and a new KFD process is

> started, creating a new runlist. Are we just getting lucky because the

> process first has to allocate some memory, which maybe makes some HW

> access (flushing TLBs etc.) that wakes up the GPU?

>

>

> Regards,

>    Felix

>

>

>

> Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):

> >

> > [AMD Official Use Only]

> >

> >

> > Thanks Kevin and Felix!

> >

> > In gfxoff state, the dequeue request(by cp register writing) can't

> > make gfxoff exit, actually the cp is powered off and the cp register

> > writing is invalid, doorbell registers writing(regluar way) or

> > directly request smu to disable gfx powergate(by invoking

> > amdgpu_gfx_off_ctrl) can trigger gfxoff exit.

> >

> > I have also tryed

> > 

> amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),

> > but it has no effect.

> >

> > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed

> > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff

> > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff

> > [10386.162297] amdgpu: mmCP_STAT : 0xffffffff

> > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff

> > [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff

> > [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff

> > [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff

> >

> > Thanks again!

> > Rico

> > ------------------------------------------------------------------------

> > *From:* Kuehling, Felix <Felix.Kuehling@xxxxxxx>

> > *Sent:* Tuesday, January 25, 2022 23:31

> > *To:* Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yin, Tianci (Rico)

> > <Tianci.Yin@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx

> > <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun

> > <Guchun.Chen@xxxxxxx>

> > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

> > I have no objection to the change. It restores the sequence that was

> > used before e9669fb78262. But I don't understand why GFX_OFF is causing

> > a preemption error during module unload, but not when KFD is in normal

> > use. Maybe it's because of the compute power profile that's normally set

> > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.

> >

> >

> > Either way, the patch is

> >

> > Acked-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>

> >

> >

> >

> > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):

> > >

> > > [AMD Official Use Only]

> > >

> > >

> > > [AMD Official Use Only]

> > >

> > >

> > > the issue is introduced in following patch, so add following

> > > information is better.

> > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/

> > > /

> > > /

> > > Reviewed-by: Yang Wang <kevinyang.wang@xxxxxxx>

> > > /

> > > /

> > > Best Regards,

> > > Kevin

> > >

> > > 

> ------------------------------------------------------------------------

> > > *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of

> > > Tianci Yin <tianci.yin@xxxxxxx>

> > > *Sent:* Tuesday, January 25, 2022 6:03 PM

> > > *To:* amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

> > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Yin, Tianci

> > > (Rico) <Tianci.Yin@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>

> > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod

> > > From: "Tianci.Yin" <tianci.yin@xxxxxxx>

> > >

> > > [why]

> > > In rmmod procedure, kfd sends cp a dequeue request, but the

> > > request does not get response, then an error message "cp

> > > queue pipe 4 queue 0 preemption failed" printed.

> > >

> > > [how]

> > > Performing kfd suspending after disabling gfxoff can fix it.

> > >

> > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930

> > > Signed-off-by: Tianci.Yin <tianci.yin@xxxxxxx>

> > > ---

> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--

> > >  1 file changed, 2 insertions(+), 2 deletions(-)

> > >

> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > > index b75d67f644e5..77e9837ba342 100644

> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> > > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct

> > > amdgpu_device *adev)

> > >                  }

> > >          }

> > >

> > > -       amdgpu_amdkfd_suspend(adev, false);

> > > -

> > >          amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);

> > >          amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);

> > >

> > > +       amdgpu_amdkfd_suspend(adev, false);

> > > +

> > >          /* Workaroud for ASICs need to disable SMC first */

> > >          amdgpu_device_smu_fini_early(adev);

> > >

> > > --

> > > 2.25.1

> > >