Re: [PATCH 3/4] drm/radeon: consolidate uvd/vce initialization, resume and suspend.

Jerome Glisse <jglisse@xxxxxxxxxx> · Wed, 16 Mar 2016 18:43:49 +0100

On Wed, Mar 16, 2016 at 06:06:36PM +0100, Christian König wrote:
> Am 16.03.2016 um 16:56 schrieb Jerome Glisse:
> >On Wed, Mar 16, 2016 at 04:19:16PM +0100, Christian König wrote:
> >>Am 16.03.2016 um 15:59 schrieb Jerome Glisse:
> >>>On Wed, Mar 16, 2016 at 2:03 PM, Christian König
> >>><deathsimple@xxxxxxxxxxx> wrote:
> >>>>Am 16.03.2016 um 13:48 schrieb Jérôme Glisse:
> >>>>>From: Jérome Glisse <jglisse@xxxxxxxxxx>
> >>>>>
> >>>>>This consolidate uvd/vce into a common shape for all generation. It
> >>>>>also leverage the rdev->has_uvd flags to know what it is useless to
> >>>>>try to resume/suspend uvd/vce block.
> >>>>>
> >>>>>There is no functional changes when there is no error. On error the
> >>>>>device driver will behave differently than before after this patch.
> >>>>>It should now safely ignore uvd/vce errors and keeps normal operation
> >>>>>of others engine. This is an improvement over current situation where
> >>>>>we have different behavior depending on GPU generation and on what
> >>>>>fails.
> >>>>>
> >>>>>Finaly this is a preparatory step for a patch which allow to disable
> >>>>>uvd/vce as a driver option.
> >>>>>
> >>>>>This have only been tested on southern island so please test it on
> >>>>>other generations (i do not have hardware handy for now).
> >>>>>
> >>>>>Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx>
> >>>>>Cc: Alex Deucher <alexander.deucher@xxxxxxx>
> >>>>>Cc: Christian König <christian.koenig@xxxxxxx>
> >>>>NAK, skipping UVD and VCE suspend/resume when initialization fails should
> >>>>already be implemented.
> >>>>
> >>>>There might be some (quite some) bugs in there, but that doesn't justify
> >>>>reworking the initialization over all different generations. Especially
> >>>>since you don't have hardware to test all of them.
> >>>>
> >>>>Just make sure that radeon_uvd/vce_fini() is called when something goes
> >>>>wrong and/or that the UVD/VCE BO is properly released.
> >>>>
> >>>>Regards,
> >>>>Christian.
> >>>Current code is a mess when it comes to handling error related to
> >>>uvd/vce. This patch consolidate control flow in something easy to
> >>>follow. You can check that there is absolulety no control flow change
> >>>for the case where uvd/vce works and thus it does not break anything
> >>>that works. It will only gracefully fails and cleanup if things go
> >>>wrong. So while i have not tested on other hw i am confident that this
> >>>does not introduce regression.
> >>>
> >>>I tried to do it without consolidation but it ended up in adding even
> >>>more if() levels that line did begins after 80colums. So please
> >>>reconsider because this is an improvement over existing code.
> >>Well then please point out at the example of the SI or CIK code what exactly
> >>is missing here.
> >Going from :
> >	if (rdev->has_uvd) {
> >		r = uvd_v2_2_resume(rdev);
> >		if (!r) {
> >			r = radeon_fence_driver_start_ring(rdev,
> >							   R600_RING_TYPE_UVD_INDEX);
> >			if (r)
> >				dev_err(rdev->dev, "UVD fences init error (%d).\n", r);
> >		}
> >		if (r)
> >			rdev->ring[R600_RING_TYPE_UVD_INDEX].ring_size = 0;
> >	}
> >	r = radeon_vce_resume(rdev);
> >	if (!r) {
> >		r = vce_v1_0_resume(rdev);
> >		if (!r)
> >			r = radeon_fence_driver_start_ring(rdev,
> >							   TN_RING_TYPE_VCE1_INDEX);
> >		if (!r)
> >			r = radeon_fence_driver_start_ring(rdev,
> >							   TN_RING_TYPE_VCE2_INDEX);
> >	}
> >	if (r) {
> >		dev_err(rdev->dev, "VCE init error (%d).\n", r);
> >		rdev->ring[TN_RING_TYPE_VCE1_INDEX].ring_size = 0;
> >		rdev->ring[TN_RING_TYPE_VCE2_INDEX].ring_size = 0;
> >	}
> >
> >
> >To:
> >	r = uvd_v2_2_resume(rdev);
> >	if (r)
> >		goto error;
> >	r = radeon_fence_driver_start_ring(rdev, R600_RING_TYPE_UVD_INDEX);
> >	if (r)
> >		goto error_uvd;
> >	r = radeon_vce_resume(rdev);
> >	if (r)
> >		goto error_uvd;
> >	r = vce_v1_0_resume(rdev);
> >	if (r)
> >		goto error_vce;
> >	r = radeon_fence_driver_start_ring(rdev, TN_RING_TYPE_VCE1_INDEX);
> >	if (r)
> >		goto error_vce;
> >	r = radeon_fence_driver_start_ring(rdev, TN_RING_TYPE_VCE2_INDEX);
> >	if (r)
> >		goto error_vce;
> >	return;
> >error_vce:
> >	radeon_vce_suspend(rdev);
> >error_uvd:
> >	radeon_uvd_suspend(rdev);
> >error:
> >	dev_err(rdev->dev, "UVD/VCE startup error (%d).\n", r);
> >	/* On error just disable everything. */
> >	radeon_vce_fini(rdev);
> >	radeon_uvd_fini(rdev);
> >	rdev->ring[R600_RING_TYPE_UVD_INDEX].ring_size = 0;
> >	rdev->ring[TN_RING_TYPE_VCE1_INDEX].ring_size = 0;
> >	rdev->ring[TN_RING_TYPE_VCE2_INDEX].ring_size = 0;
> 
> And as I said that is exactly what you should NOT be doing here. Once the
> firmware is loaded the block should be kept in that state.
> 
> Freeing the memory allocated for the firmware is also not a good idea at all
> because we don't know who exactly is accessing it.

And what i am saying is that block is not responding so it does not matter
what memory endup there as it only try to read from the firmware afaict.
So nothing bad will come from that, block is already in non functional
state what worse can happen ?

> >Is lot more clear to me than bunch of intertwine if/else. A clear error path
> >for which you do not have to jump through if level to see what get executed
> >or not on error. The only difference is that it does tie uvd and vce together.
> >I did that on purpose because on the hw i am playing with the vce seems to be
> >useless when the uvd block fails (opposite seems to be true too). If you think
> >we should still try to init vce when uvd fails or uvd when vce fails i can
> >split uvd and vce.
> 
> UVD and VCE are two completely separate blocks, they shouldn't be related to
> each other in anyway.
> 
> When you see failures of both at the same time it's rather unlikely that it
> is actually related to them.

Fine i will split then.

> >The other difference with existing code is that i free resources normaly use
> >uvd/vce on error (free fw buffer). This is just me trying to free resource
> >early and it has no impact as block are not working.
> >
> >-----------------------------------------------------------------------------------
> >
> >Second part we go from:
> >	if (rdev->has_uvd) {
> >		ring = &rdev->ring[R600_RING_TYPE_UVD_INDEX];
> >		if (ring->ring_size) {
> >			r = radeon_ring_init(rdev, ring, ring->ring_size, 0,
> >					     RADEON_CP_PACKET2);
> >			if (!r)
> >				r = uvd_v1_0_init(rdev);
> >			if (r)
> >				DRM_ERROR("radeon: failed initializing UVD (%d).\n", r);
> >		}
> >	}
> >	r = -ENOENT;
> >	ring = &rdev->ring[TN_RING_TYPE_VCE1_INDEX];
> >	if (ring->ring_size)
> >		r = radeon_ring_init(rdev, ring, ring->ring_size, 0,
> >				     VCE_CMD_NO_OP);
> >	ring = &rdev->ring[TN_RING_TYPE_VCE2_INDEX];
> >	if (ring->ring_size)
> >		r = radeon_ring_init(rdev, ring, ring->ring_size, 0,
> >				     VCE_CMD_NO_OP);
> >	if (!r)
> >		r = vce_v1_0_init(rdev);
> >	else if (r != -ENOENT)
> >		DRM_ERROR("radeon: failed initializing VCE (%d).\n", r);
> >
> >
> >To:
> >	ring = &rdev->ring[R600_RING_TYPE_UVD_INDEX];
> >	r = radeon_ring_init(rdev, ring, ring->ring_size, 0, RADEON_CP_PACKET2);
> >	if (r)
> >		goto error;
> >	r = uvd_v1_0_init(rdev);
> >	if (r)
> >		goto error_uvd;
> >	ring = &rdev->ring[TN_RING_TYPE_VCE1_INDEX];
> >	r = radeon_ring_init(rdev, ring, ring->ring_size, 0, VCE_CMD_NO_OP);
> >	if (r)
> >		goto error_vce1;
> >	ring = &rdev->ring[TN_RING_TYPE_VCE2_INDEX];
> >	r = radeon_ring_init(rdev, ring, ring->ring_size, 0, VCE_CMD_NO_OP);
> >	if (r)
> >		goto error_vce2;
> >	r = vce_v1_0_init(rdev);
> >	if (r)
> >		goto error_vce;
> >	return;
> >error_vce:
> >	radeon_ring_fini(rdev, &rdev->ring[TN_RING_TYPE_VCE2_INDEX]);
> >error_vce2:
> >	radeon_ring_fini(rdev, &rdev->ring[TN_RING_TYPE_VCE1_INDEX]);
> >error_vce1:
> >	uvd_v1_0_fini(rdev);
> >error_uvd:
> >	radeon_ring_fini(rdev, &rdev->ring[R600_RING_TYPE_UVD_INDEX]);
> >error:
> >	dev_err(rdev->dev, "UVD/VCE resume error (%d).\n", r);
> >	/* On error just disable everything. */
> >	radeon_uvd_suspend(rdev);
> >	radeon_vce_suspend(rdev);
> >	radeon_uvd_fini(rdev);
> >	radeon_vce_fini(rdev);
> >	rdev->ring[R600_RING_TYPE_UVD_INDEX].ring_size = 0;
> >	rdev->ring[TN_RING_TYPE_VCE1_INDEX].ring_size = 0;
> >	rdev->ring[TN_RING_TYPE_VCE2_INDEX].ring_size = 0;
> >
> >Again lot simpler to follow control flow than to jump through various level
> >of if/else. Again uvd and vce tied together (and again i can untie them if
> >you think it is better to untie them).
> >
> >But this time the extra thing is that i properly disable ring if any error
> >happens while existing code does not.
> 
> And again that is exactly what we should NOT do.
> 
> When initialization fails we don't know in which state the ring buffer and
> micro engines are, so freeing them and giving the space back to be reused in
> clearly not a good idea.
> 
> All we should do is clearing the ready flag when something fails to prevent
> userspace from making command submissions to the failed engine.

This block only read from that memory ! Never write (unless it modify the fw
in place which would surprise me). So when block is in bad states it does not
matter what endup there, block is not processing anything.

> >I am not pasting the init path but it the same logic, tying uvd and vce
> >together and simplifying error code path.
> >
> >
> >
> >>Please also note that VCE/UVD has dependencies on power management, so that
> >>when they are once initialized they should NOT be turned off again.
> >>
> >>I only briefly skimmed over your patch, but it actually looks like to me
> >>that you broken that by trying to cleanup the initialization routine.
> >I have seen that but assuming Heisenbergs does not get involve, then given
> >that the block is not responding to register write it is unlikely that thing
> >will we worse if we try to disable the block. And from my testing it does
> >not impact power management. My guess is that the block keep reporting it is
> >busy and that power gating and clock gating are inhibited by that.
> 
> It's more complicated than that just a simple busy signal. The engines
> actively communicate with the power management controller to tell them their
> needs and limits for the clocks based on the workload they have.
> 
> Once initialized the power management controller expects the UVD and VCE
> micro-controllers to answer such requests.
> 
> Failing to do so can get you stuck at a specific power level.

Yes and what i am saying is that block is in bad states so either it is
answering always busy and thus stop power level change or it is not answering
and power level can change. So trying to disable the block leads to the
exact same situation, either the block shutdown and power level can change
or it does not and thing are exactly as if we did nothing !

> >The other thing i am doing over existing code is freeing memory for the fw
> >buffer. I do not think it is a big deal. I am doing that because then i just
> >flag the uvd has dead (rdev->has_uvd = 0) and avoid to try to restore it
> >for next suspend/resume cycle or hibernation cycle.
> >
> >So again the only thing i am change is the case where thing does not work.
> >With that patch i can actualy hibernate laptop and get back a working desktop
> >module video decoding/encoding no longer working. I call that an improvement.
> 
> It's nice that it works for you now, but my laptop is working fine with UVD
> and VCE as well and I would like to keep it that way.

And my patch will not break that, test it if you do not believe me !

> As far as I can see you're actually messing the error handling up quite a
> bit here instead of improving it.

I am not ! I AM MAKING IT CLEAR WHAT HAPPENS. YES ON ERROR I TRY TO DISABLE
THING BUT IF BLOCK IS NOT ANSWERING THIS CAN NOT MAKE THING WORSE.

> So please describe in detail what the problems you are seeing and why
> disabling both UVD and VCE helps with them.

I said it above already, on second part when we fails uvd/vce we do not
disable ring and things goes from bad to worse.

> A kernel log from a failed suspend/resume cycle would help quite a bit here.

[   62.111042] [drm] ring test on 5 succeeded in 2 usecs
[   62.111046] [drm] UVD initialized successfully.
[   62.111080] [drm] ib test on ring 0 succeeded in 0 usecs
[   62.111110] [drm] ib test on ring 1 succeeded in 0 usecs
[   62.111137] [drm] ib test on ring 2 succeeded in 0 usecs
[   62.111198] [drm] ib test on ring 3 succeeded in 0 usecs
[   62.111257] [drm] ib test on ring 4 succeeded in 0 usecs
[   62.309312] usb 1-7: reset high-speed USB device number 3 using xhci_hcd
[   62.362819] ACPI: \_SB_.PCI0.LPCB.SIO_.COM1: ACPI_NOTIFY_BUS_CHECK event
[   62.362824] ACPI: \_SB_.PCI0.RP05: ACPI_NOTIFY_BUS_CHECK event
[   62.362825] ACPI: \_SB_.PCI0.EHC1: ACPI_NOTIFY_BUS_CHECK event
[   62.362827] ACPI: \_SB_.PCI0.EHC2: ACPI_NOTIFY_BUS_CHECK event
[   62.626546] usb 1-12: reset full-speed USB device number 4 using xhci_hcd
[   62.793373] parport_pc 00:08: activated
[   62.858604] tpm_tis 00:0c: TPM is disabled/deactivated (0x7)
[   63.625310] psmouse serio2: alps: Unknown ALPS touchpad: E7=10 00 64, EC=10 00 64
[   64.826916] psmouse serio3: synaptics: queried max coordinates: x [..5660], y [..4730]
[   64.858652] psmouse serio3: synaptics: queried min coordinates: x [1324..], y [1248..]
[   72.288487] radeon 0000:01:00.0: ring 5 stalled for more than 10020msec
[   72.288489] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000002 last fence id 0x0000000000000004 on ring 5)
[   72.288577] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
[   72.288594] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 5 (-35).
[   72.473034] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed
[   73.452173] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[   73.452197] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[   74.296728] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[   74.296744] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed

So first par of uvd succeed but vce_v1_0_resume() fails and after that uvd fails too
(vce too). I doubt you can fix that until you can fix the firmware signature check
issue. What happens is that on resume from hibernation the first kernel load radeon
and initialize the GPU (this work properly and VCE block is initialized properly).

But then it resume the hibernated kernel (kexec) and the radeon resume code path by
reprogramming the memory configuration of the GPU make the fw unaccessible or at
different address compare to first boot kernel and thing from that point the vce
block is in bad state (impacting the uvd block afaict). This trickle down to other
block like displayport and you endup with no display.

Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel