On 29.03.2023 21:45, Dmitry Baryshkov wrote: > On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio <konrad.dybcio@xxxxxxxxxx> wrote: >> >> >> >> On 29.03.2023 16:37, Johan Hovold wrote: >>> On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: >>>> If we fail to initialize the GPU for whatever reason (say we don't >>>> embed the GPU firmware files in the initrd), the error path involves >>>> pm_runtime_put_sync() which then calls idle() instead of suspend(). >>>> >>>> This is suboptimal, as it means that we're not going through the >>>> clean shutdown sequence. With at least A619_holi, this makes the GPU >>>> not wake up until it goes through at least one more start-fail-stop >>>> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean >>>> shutdown. >>> >>> This does not sound right. If pm_runtime_put_sync() fails to suspend the >>> device when the usage count drops to zero, then you have a bug somewhere >>> else. >> I was surprised to see that it was not called as well, but I wasn't able >> to track it down before.. > > Could you please check that it's autosuspend who kicks in? In other > words, if we disable autosuspend, the pm_runtime_put_sync is enough()? > > That would probably mean that we lack some kind of reset in the hw_init path. > > On the other hand, I do not know how the device will react to the > error-in-the-middle state. Modems for example, can enter the state > where you can not properly turn it off once it starts the boot > process. > > And if we remember the efforts that Akhil has put into making sure > that the GPU is properly reset in case of an _error_, it might be > nearly impossible to shut it down in a proper way. > > Thus said, I think that unless there is an obvious way to restart the > init process, Korad's pm_runtime_put_sync_suspend() looks like a > correct fix to me. On the GPU side, when you cut GX and CX GDSCs, the hardware is off. Some clock / gdsc logic may be retained, but the GPU itself gets cut off. Parking the clocks and shuttting down VDD_GX (if exists) only makes that stronger. > >>> Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware >>> before bringing up the hardware") the firmware is loaded before even >>> hitting these paths so the above description does not sound right in >>> that respect either (or is missing some details). >> ..but I did some more digging and I found that the precise "firmware" >> that fails is the ZAP blob, which is not checked like SQE in the >> commit you mentioned! >> >> Now I don't think that we can easily check for it as-is since >> zap_shader_load_mdt() does the entire find-load-authenticate >> dance which is required with secure assets, but it's obviously >> possible to rip out the find-load part of that and go on from >> there. > > Yes, I think we should load all firmware early. ZAP shader is a bit > unique since the DT can override the name, but it might be nice to > check for its presence earlier. > > At the same time it probably should not stop us from fixing the idle() > vs suspend() bug. I'm open to both solutions, as long as it can unblock me from resubmitting the (hopefully) final version of GMU wrapper! Konrad > >> >> Do you think that would be a better solution? >> >> Konrad >> >>> >>>> Test cases: >>>> 1. firmware baked into kernel >>>> 2. error loading fw in initrd -> load from rootfs at DE start >>>> >>>> Both succeed on A619_holi (SM6375) and A630 (SDM845). >>>> >>>> Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") >>>> Signed-off-by: Konrad Dybcio <konrad.dybcio@xxxxxxxxxx> >>>> --- >>>> drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> index f61896629be6..59f3302e8167 100644 >>>> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c >>>> @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) >>>> return gpu; >>>> >>>> err_put_rpm: >>>> - pm_runtime_put_sync(&pdev->dev); >>>> + pm_runtime_put_sync_suspend(&pdev->dev); >>>> err_disable_rpm: >>>> pm_runtime_disable(&pdev->dev); >>> >>> Johan > > >