Re: [PATCH] drm/amdgpu: move buffer funcs setting up a level

Michel Dänzer <michel.daenzer@xxxxxxxxxxx> · Tue, 7 Nov 2023 17:46:28 +0100

On 11/7/23 15:47, Alex Deucher wrote:
> On Tue, Nov 7, 2023 at 9:19 AM Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
>> On Tue, Nov 7, 2023 at 5:52 AM Christian König
>> <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>>> Am 03.11.23 um 23:10 schrieb Alex Deucher:
>>>> On Fri, Nov 3, 2023 at 4:17 PM Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
>>>>> On Thu, Oct 26, 2023 at 4:17 PM Luben Tuikov <ltuikov89@xxxxxxxxx> wrote:
>>>>>> Pushed to drm-misc-next.
>>>>> BTW, I'm seeing the following on older GPUs with VCE and UVD even with
>>>>> this patch:
>>>>> [   11.886024] amdgpu 0000:0a:00.0: [drm] *ERROR* drm_sched_job_init:
>>>>> entity has no rq!
>>>>> [   11.886028] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests
>>>>> [amdgpu]] *ERROR* IB test failed on uvd (-2).
>>>>> [   11.889927] amdgpu 0000:0a:00.0: [drm] *ERROR* drm_sched_job_init:
>>>>> entity has no rq!
>>>>> [   11.889930] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests
>>>>> [amdgpu]] *ERROR* IB test failed on vce0 (-2).
>>>>> [   11.890172] [drm:process_one_work] *ERROR* ib ring test failed (-2).
>>>>> Seems to be specific to UVD and VCE, I don't see anything similar with
>>>>> VCN, but the flows for both are pretty similar.  Not sure why we are
>>>>> not seeing it for VCN.  Just a heads up if you have any ideas.  Will
>>>>> take a closer look next week.
>>>> + Leo
>>>>
>>>> I found the problem.  We set up scheduling entities for UVD and VCE
>>>> specifically and not for any other engines.  I don't remember why
>>>> offhand.  I'm guessing maybe to deal with the session limits on UVD
>>>> and VCE?  If so I'm not sure of a clean way to fix this.
>>>
>>> I haven't looked through all my mails yet so could be that Leo has
>>> already answered this.
>>>
>>> The UVD/VCE entities are used for the older chips where applications
>>> have to use create/destroy messages to the firmware.
>>>
>>> If an application exits without cleaning up their handles the kernel
>>> sends the appropriate destroy messages itself. For an example see
>>> amdgpu_uvd_free_handles().
>>>
>>> We used to initialize those entities with separate calls after the
>>> scheduler had been brought up, see amdgpu_uvd_entity_init() for an example.
>>>
>>> But this was somehow messed up and we now do the call to
>>> amdgpu_uvd_entity_init() at the end of *_sw_init() instead of _late_init().
>>>
>>> I suggest to just come up with a function which can be used for the
>>> late_init() callback of the UVD/VCE blocks.
>>
>> I guess the issue is that we only need to initialize the entity once
>> so sw_init makes sense.  All of the other functions get called at
>> resume time, etc.  I think we could probably put it into
>> amdgpu_device_init_schedulers() somehow.
> 
> I think something like this might do the trick.

This does indeed fix the IB test failures for me with Bonaire.

There are still

[drm] Fence fallback timer expired on ring sdma0

messages, that might be a separate regression though.

-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer