Re: [PATCH] drm/amdgpu: Fix an error message in rmmod

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I see, thanks for clarifying. So this is happening because we unmap the HIQ with direct MMIO register writes instead of using the KIQ.


I'm OK with this patch as a workaround, but as a proper fix, we should probably add a hiq_hqd_destroy function that uses KIQ, similar to how we have hiq_mqd_load functions that use KIQ to map the HIQ.


Regards,
  Felix



Am 2022-01-27 um 21:34 schrieb Yin, Tianci (Rico):

[AMD Official Use Only]


The error message is from HIQ dequeue procedure,  not from HCQ, so no doorbell writing.

Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace:
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072]  dump_stack+0x7d/0x9c
Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651]  hqd_destroy_v10_3+0x58/0x254 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.477778]  destroy_mqd+0x1e/0x30 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884]  kernel_queue_uninit+0xcf/0x100 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985]  pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127]  stop_cpsch+0x98/0x100 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242]  kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338]  kgd2kfd_suspend+0x1b/0x20 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433]  amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529]  amdgpu_device_fini_hw+0x182/0x335 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655]  amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732]  amdgpu_pci_remove+0x27/0x40 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806]  pci_device_remove+0x3e/0xb0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809]  device_release_driver_internal+0x103/0x1d0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813]  driver_detach+0x4c/0x90 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814]  bus_remove_driver+0x5c/0xd0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815]  driver_unregister+0x31/0x50 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817]  pci_unregister_driver+0x40/0x90 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818]  amdgpu_exit+0x15/0x2d1 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942]  __x64_sys_delete_module+0x147/0x260 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944]  ? exit_to_user_mode_prepare+0x41/0x1d0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946]  ? ksys_write+0x67/0xe0
Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948]  do_syscall_64+0x40/0xb0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Regards,
Rico
------------------------------------------------------------------------
*From:* Kuehling, Felix <Felix.Kuehling@xxxxxxx>
*Sent:* Thursday, January 27, 2022 23:28
*To:* Yin, Tianci (Rico) <Tianci.Yin@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>
*Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
The hang you're seeing is the result of a command submission of an
UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a
doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up
the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this
doorbell not trigger gfxoff exit during rmmod?


Regards,
   Felix



Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico):
>
> [AMD Official Use Only]
>
>
> The rmmod ops has prerequisite multi-user target and blacklist amdgpu,
> which is IGT requirement so that IGT can make itself DRM master to
> test KMS.
> igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload
>
> From my understanding, the KFD process belongs to the regular way of
> gfxoff exit, which doorbell writing triggers gfxoff exit. For example,
> KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ,
> these both trigger doorbell writing(pls refer to
> gfx_v10_0_ring_set_wptr_compute()).
>
> As to the IGT reload test, the dequeue request is not thru a cmd on a
> ring, it directly writes CP registers, so GFX core remains in gfxoff.
>
> Thanks,
> Rico
>
> ------------------------------------------------------------------------
> *From:* Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> *Sent:* Wednesday, January 26, 2022 23:08
> *To:* Yin, Tianci (Rico) <Tianci.Yin@xxxxxxx>; Wang, Yang(Kevin)
> <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
> *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun
> <Guchun.Chen@xxxxxxx>
> *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> My question is, why is this problem only seen during module unload? Why
> aren't we seeing HWS hangs due to GFX_OFF all the time in normal
> operations? For example when the GPU is idle and a new KFD process is
> started, creating a new runlist. Are we just getting lucky because the
> process first has to allocate some memory, which maybe makes some HW
> access (flushing TLBs etc.) that wakes up the GPU?
>
>
> Regards,
>    Felix
>
>
>
> Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico):
> >
> > [AMD Official Use Only]
> >
> >
> > Thanks Kevin and Felix!
> >
> > In gfxoff state, the dequeue request(by cp register writing) can't
> > make gfxoff exit, actually the cp is powered off and the cp register
> > writing is invalid, doorbell registers writing(regluar way) or
> > directly request smu to disable gfx powergate(by invoking
> > amdgpu_gfx_off_ctrl) can trigger gfxoff exit.
> >
> > I have also tryed
> >
> amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false),
> > but it has no effect.
> >
> > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed
> > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0xffffffff
> > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0xffffffff
> > [10386.162297] amdgpu: mmCP_STAT : 0xffffffff
> > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0xffffffff
> > [10386.162308] amdgpu: mmRLC_STAT : 0xffffffff
> > [10386.162314] amdgpu: mmGRBM_STATUS : 0xffffffff
> > [10386.162320] amdgpu: mmGRBM_STATUS2: 0xffffffff
> >
> > Thanks again!
> > Rico
> > ------------------------------------------------------------------------
> > *From:* Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> > *Sent:* Tuesday, January 25, 2022 23:31
> > *To:* Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yin, Tianci (Rico)
> > <Tianci.Yin@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
> > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun
> > <Guchun.Chen@xxxxxxx>
> > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
> > I have no objection to the change. It restores the sequence that was
> > used before e9669fb78262. But I don't understand why GFX_OFF is causing
> > a preemption error during module unload, but not when KFD is in normal
> > use. Maybe it's because of the compute power profile that's normally set
> > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS.
> >
> >
> > Either way, the patch is
> >
> > Acked-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>
> >
> >
> >
> > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin):
> > >
> > > [AMD Official Use Only]
> > >
> > >
> > > [AMD Official Use Only]
> > >
> > >
> > > the issue is introduced in following patch, so add following
> > > information is better.
> > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/
> > > /
> > > /
> > > Reviewed-by: Yang Wang <kevinyang.wang@xxxxxxx>
> > > /
> > > /
> > > Best Regards,
> > > Kevin
> > >
> > >
> ------------------------------------------------------------------------
> > > *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of
> > > Tianci Yin <tianci.yin@xxxxxxx>
> > > *Sent:* Tuesday, January 25, 2022 6:03 PM
> > > *To:* amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
> > > *Cc:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Yin, Tianci
> > > (Rico) <Tianci.Yin@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>
> > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod
> > > From: "Tianci.Yin" <tianci.yin@xxxxxxx>
> > >
> > > [why]
> > > In rmmod procedure, kfd sends cp a dequeue request, but the
> > > request does not get response, then an error message "cp
> > > queue pipe 4 queue 0 preemption failed" printed.
> > >
> > > [how]
> > > Performing kfd suspending after disabling gfxoff can fix it.
> > >
> > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930
> > > Signed-off-by: Tianci.Yin <tianci.yin@xxxxxxx>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index b75d67f644e5..77e9837ba342 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct
> > > amdgpu_device *adev)
> > >                  }
> > >          }
> > >
> > > -       amdgpu_amdkfd_suspend(adev, false);
> > > -
> > >          amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> > >          amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
> > >
> > > +       amdgpu_amdkfd_suspend(adev, false);
> > > +
> > >          /* Workaroud for ASICs need to disable SMC first */
> > > amdgpu_device_smu_fini_early(adev);
> > >
> > > --
> > > 2.25.1
> > >



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux