[AMD Official Use Only]
The error message is from HIQ dequeue procedure, not from HCQ, so no doorbell writing.
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072] dump_stack+0x7d/0x9c
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651] hqd_destroy_v10_3+0x58/0x254 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477778] destroy_mqd+0x1e/0x30 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884] kernel_queue_uninit+0xcf/0x100 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985] pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127] stop_cpsch+0x98/0x100 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242] kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338] kgd2kfd_suspend+0x1b/0x20 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433] amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529] amdgpu_device_fini_hw+0x182/0x335 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655] amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732] amdgpu_pci_remove+0x27/0x40 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806] pci_device_remove+0x3e/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809] device_release_driver_internal+0x103/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813] driver_detach+0x4c/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814] bus_remove_driver+0x5c/0xd0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815] driver_unregister+0x31/0x50
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817] pci_unregister_driver+0x40/0x90
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818] amdgpu_exit+0x15/0x2d1 [amdgpu]
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942] __x64_sys_delete_module+0x147/0x260
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944] ? exit_to_user_mode_prepare+0x41/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946] ? ksys_write+0x67/0xe0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948] do_syscall_64+0x40/0xb0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951] entry_SYSCALL_64_after_hwframe+0x44/0xae
Regards,
Rico
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Thursday, January 27, 2022 23:28 To: Yin, Tianci (Rico) <Tianci.Yin@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod The hang you're seeing is the result of a command submission of an
UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this doorbell not trigger gfxoff exit during rmmod? Regards, Felix Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico): > > [AMD Official Use Only] > > > The rmmod ops has prerequisite multi-user target and blacklist amdgpu, > which is IGT requirement so that IGT can make itself DRM master to > test KMS. > igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload > > From my understanding, the KFD process belongs to the regular way of > gfxoff exit, which doorbell writing triggers gfxoff exit. For example, > KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, > these both trigger doorbell writing(pls refer to > gfx_v10_0_ring_set_wptr_compute()). > > As to the IGT reload test, the dequeue request is not thru a cmd on a > ring, it directly writes CP registers, so GFX core remains in gfxoff. > > Thanks, > Rico > > ------------------------------------------------------------------------ > *From:* Kuehling, Felix <Felix.Kuehling@xxxxxxx> > *Sent:* Wednesday, January 26, 2022 23:08 > *To:* Yin, Tianci (Rico) <Tianci.Yin@xxxxxxx>; Wang, Yang(Kevin) > <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun > <Guchun.Chen@xxxxxxx> > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod > My question is, why is this problem only seen during module unload? Why > aren't we seeing HWS hangs due to GFX_OFF all the time in normal > operations? For example when the GPU is idle and a new KFD process is > started, creating a new runlist. Are we just getting lucky because the > process first has to allocate some memory, which maybe makes some HW > access (flushing TLBs etc.) that wakes up the GPU? > > > Regards, > Felix > > > > Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico): > > > > [AMD Official Use Only] > > > > > > Thanks Kevin and Felix! > > > > In gfxoff state, the dequeue request(by cp register writing) can't > > make gfxoff exit, actually the cp is powered off and the cp register > > writing is invalid, doorbell registers writing(regluar way) or > > directly request smu to disable gfx powergate(by invoking > > amdgpu_gfx_off_ctrl) can trigger gfxoff exit. > > > > I have also tryed > > > amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false), > > but it has no effect. > > > > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed > > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff > > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff > > [10386.162297] amdgpu: mmCP_STAT : 0xffffffff > > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff > > [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff > > [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff > > [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff > > > > Thanks again! > > Rico > > ------------------------------------------------------------------------ > > *From:* Kuehling, Felix <Felix.Kuehling@xxxxxxx> > > *Sent:* Tuesday, January 25, 2022 23:31 > > *To:* Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yin, Tianci (Rico) > > <Tianci.Yin@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun > > <Guchun.Chen@xxxxxxx> > > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod > > I have no objection to the change. It restores the sequence that was > > used before e9669fb78262. But I don't understand why GFX_OFF is causing > > a preemption error during module unload, but not when KFD is in normal > > use. Maybe it's because of the compute power profile that's normally set > > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS. > > > > > > Either way, the patch is > > > > Acked-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > > > > > > > > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin): > > > > > > [AMD Official Use Only] > > > > > > > > > [AMD Official Use Only] > > > > > > > > > the issue is introduced in following patch, so add following > > > information is better. > > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/ > > > / > > > / > > > Reviewed-by: Yang Wang <kevinyang.wang@xxxxxxx> > > > / > > > / > > > Best Regards, > > > Kevin > > > > > > > ------------------------------------------------------------------------ > > > *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of > > > Tianci Yin <tianci.yin@xxxxxxx> > > > *Sent:* Tuesday, January 25, 2022 6:03 PM > > > *To:* amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> > > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Yin, Tianci > > > (Rico) <Tianci.Yin@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx> > > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod > > > From: "Tianci.Yin" <tianci.yin@xxxxxxx> > > > > > > [why] > > > In rmmod procedure, kfd sends cp a dequeue request, but the > > > request does not get response, then an error message "cp > > > queue pipe 4 queue 0 preemption failed" printed. > > > > > > [how] > > > Performing kfd suspending after disabling gfxoff can fix it. > > > > > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930 > > > Signed-off-by: Tianci.Yin <tianci.yin@xxxxxxx> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index b75d67f644e5..77e9837ba342 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct > > > amdgpu_device *adev) > > > } > > > } > > > > > > - amdgpu_amdkfd_suspend(adev, false); > > > - > > > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); > > > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); > > > > > > + amdgpu_amdkfd_suspend(adev, false); > > > + > > > /* Workaroud for ASICs need to disable SMC first */ > > > amdgpu_device_smu_fini_early(adev); > > > > > > -- > > > 2.25.1 > > > |