Re: [PATCH 3/4] drm/radeon: consolidate uvd/vce initialization, resume and suspend.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 16.03.2016 um 18:43 schrieb Jerome Glisse:
On Wed, Mar 16, 2016 at 06:06:36PM +0100, Christian König wrote:
[SNIP]
And as I said that is exactly what you should NOT be doing here. Once the
firmware is loaded the block should be kept in that state.

Freeing the memory allocated for the firmware is also not a good idea at all
because we don't know who exactly is accessing it.
And what i am saying is that block is not responding so it does not matter
what memory endup there as it only try to read from the firmware afaict.
So nothing bad will come from that, block is already in non functional
state what worse can happen ?

That the block is not responding to CPU register accesses just means that the SRBM read side interface doesn't work for some reason.

Depending on what the cause is the micro engine usually still happily read and write memory when that happens.

When the whole engine is hung you usually don't get working power management as well.

[SNIP]

Again lot simpler to follow control flow than to jump through various level
of if/else. Again uvd and vce tied together (and again i can untie them if
you think it is better to untie them).

But this time the extra thing is that i properly disable ring if any error
happens while existing code does not.
And again that is exactly what we should NOT do.

When initialization fails we don't know in which state the ring buffer and
micro engines are, so freeing them and giving the space back to be reused in
clearly not a good idea.

All we should do is clearing the ready flag when something fails to prevent
userspace from making command submissions to the failed engine.
This block only read from that memory ! Never write (unless it modify the fw
in place which would surprise me). So when block is in bad states it does not
matter what endup there, block is not processing anything.

No, the memory is quite heavily written as well. Only BAR0 is code, BAR1 is the heap and BAR2 the stack of the micro engine.

Those are usually heavily accessed by the interrupt handlers even when the engine is completely stuck otherwise.

The only way I know to prevent that is to stall the both the register as well as the memory bus, wait for a moment and then put the micro engine into soft reset. See uvd_v1_0_stop() how that is done.

But as I said after that you can usually forget power management as well.

I am not pasting the init path but it the same logic, tying uvd and vce
together and simplifying error code path.



Please also note that VCE/UVD has dependencies on power management, so that
when they are once initialized they should NOT be turned off again.

I only briefly skimmed over your patch, but it actually looks like to me
that you broken that by trying to cleanup the initialization routine.
I have seen that but assuming Heisenbergs does not get involve, then given
that the block is not responding to register write it is unlikely that thing
will we worse if we try to disable the block. And from my testing it does
not impact power management. My guess is that the block keep reporting it is
busy and that power gating and clock gating are inhibited by that.
It's more complicated than that just a simple busy signal. The engines
actively communicate with the power management controller to tell them their
needs and limits for the clocks based on the workload they have.

Once initialized the power management controller expects the UVD and VCE
micro-controllers to answer such requests.

Failing to do so can get you stuck at a specific power level.
Yes and what i am saying is that block is in bad states so either it is
answering always busy and thus stop power level change or it is not answering
and power level can change. So trying to disable the block leads to the
exact same situation, either the block shutdown and power level can change
or it does not and thing are exactly as if we did nothing !

I've tested that multiple times and as long as you don't turn of the VCPU completely it's usually still answering the interrupt handler even when it is stuck in an endless loop otherwise and not doing anything else.

That's also the reason why it is so dangerous to just put the block into reset in this state or touch it otherwise.

You can't stall it because the SRBM interface is borked and you don't have access to the registers any more and you can't properly reset it either because when you do that in the middle of a memory transaction the whole system goes down.

[SNIP]
As far as I can see you're actually messing the error handling up quite a
bit here instead of improving it.
I am not ! I AM MAKING IT CLEAR WHAT HAPPENS. YES ON ERROR I TRY TO DISABLE
THING BUT IF BLOCK IS NOT ANSWERING THIS CAN NOT MAKE THING WORSE.

Calm down, I'm just trying to explain here how the hardware works and why your approach can cause trouble as well.

I'm perfectly fine with the cleanups on the control flow and the coding style as long as we can properly test them on at least some hardware generations.

So please describe in detail what the problems you are seeing and why
disabling both UVD and VCE helps with them.
I said it above already, on second part when we fails uvd/vce we do not
disable ring and things goes from bad to worse.

A kernel log from a failed suspend/resume cycle would help quite a bit here.
[   62.111042] [drm] ring test on 5 succeeded in 2 usecs
[   62.111046] [drm] UVD initialized successfully.
[   62.111080] [drm] ib test on ring 0 succeeded in 0 usecs
[   62.111110] [drm] ib test on ring 1 succeeded in 0 usecs
[   62.111137] [drm] ib test on ring 2 succeeded in 0 usecs
[   62.111198] [drm] ib test on ring 3 succeeded in 0 usecs
[   62.111257] [drm] ib test on ring 4 succeeded in 0 usecs
[   62.309312] usb 1-7: reset high-speed USB device number 3 using xhci_hcd
[   62.362819] ACPI: \_SB_.PCI0.LPCB.SIO_.COM1: ACPI_NOTIFY_BUS_CHECK event
[   62.362824] ACPI: \_SB_.PCI0.RP05: ACPI_NOTIFY_BUS_CHECK event
[   62.362825] ACPI: \_SB_.PCI0.EHC1: ACPI_NOTIFY_BUS_CHECK event
[   62.362827] ACPI: \_SB_.PCI0.EHC2: ACPI_NOTIFY_BUS_CHECK event
[   62.626546] usb 1-12: reset full-speed USB device number 4 using xhci_hcd
[   62.793373] parport_pc 00:08: activated
[   62.858604] tpm_tis 00:0c: TPM is disabled/deactivated (0x7)
[   63.625310] psmouse serio2: alps: Unknown ALPS touchpad: E7=10 00 64, EC=10 00 64
[   64.826916] psmouse serio3: synaptics: queried max coordinates: x [..5660], y [..4730]
[   64.858652] psmouse serio3: synaptics: queried min coordinates: x [1324..], y [1248..]
[   72.288487] radeon 0000:01:00.0: ring 5 stalled for more than 10020msec
[   72.288489] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000002 last fence id 0x0000000000000004 on ring 5)
[   72.288577] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
[   72.288594] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 5 (-35).
[   72.473034] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed
[   73.452173] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[   73.452197] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[   74.296728] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[   74.296744] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed


So first par of uvd succeed but vce_v1_0_resume() fails and after that uvd fails too
(vce too). I doubt you can fix that until you can fix the firmware signature check
issue. What happens is that on resume from hibernation the first kernel load radeon
and initialize the GPU (this work properly and VCE block is initialized properly).

I'm not sure if that is actually an problem with the firmware signature check, but let's focus on the issue at hand first.

But then it resume the hibernated kernel (kexec) and the radeon resume code path by
reprogramming the memory configuration of the GPU make the fw unaccessible or at
different address compare to first boot kernel and thing from that point the vce
block is in bad state (impacting the uvd block afaict). This trickle down to other
block like displayport and you endup with no display.

Think about it, that's exactly as I explained above.

The VCPU (UVD) and ECPU (VCE) keeps running after the first boot cycle, even when they are not accessible from the CPU any more for some reason.

Because of this the power management can still switch clocks and turn blocks on and off. etc... and that's why the memory clock can be raised to match your resolution on the screen and DP link training still works.

Now on the second cycle we somehow totally crash them and because of that the power management can't raise the clocks any more:
[   72.473034] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed

That results in not enough memory bandwidth or a PLL being powered down and so DP link training fails as well.

As a band aid I suggest to just set has_uvd to false when either of the vce resume or the uvd resume fails.

Also feel free to cleanup the control flow on coding style as long as you don't try to free the memory for the micro engines or try to touch the hardware blocks when something fails.

Regards,
Christian.
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux