Re: [DMC_BUGFIX_SKL_V2 1/5] drm/i915/skl: Added a check for the hardware status of csr fw before loading.

Animesh Manna <animesh.manna@xxxxxxxxx> · Thu, 17 Sep 2015 00:53:21 +0530

On 9/14/2015 1:16 PM, Daniel Vetter wrote:
On Fri, Sep 11, 2015 at 12:36:24AM +0530, Animesh Manna wrote:

On 9/10/2015 8:15 PM, Daniel Vetter wrote:
On Thu, Sep 10, 2015 at 01:58:54AM +0530, Animesh Manna wrote:
On 9/2/2015 2:24 PM, Daniel Vetter wrote:
On Wed, Aug 26, 2015 at 07:40:54PM +0530, Animesh Manna wrote:
On 8/26/2015 6:40 PM, Daniel Vetter wrote:
On Wed, Aug 26, 2015 at 01:36:05AM +0530, Animesh Manna wrote:
Dmc will restore the csr program except DC9, cold boot,
warm reset, PCI function level reset, and hibernate/suspend.

intel_csr_load_program() function is used to load the firmware
data from kernel memory to csr address space.

All values of csr address space will be zero if it got reset and
the first byte of csr program is always a non-zero if firmware
is loaded successfuly. Based on hardware status will load the
firmware.

Without this condition check if we overwrite the firmware data the
counters exposed for dc5/dc6 (help for debugging) will be nullified.
Bacause of the above reason mentioned just above we need to block firmware loading again.
So only WARN_ON will not help.


v1: Initial version.

v2: Based on review comments from Daniel,
- Added a check to know hardware status and load the firmware if not loaded.

Cc: Daniel Vetter <daniel.vetter@xxxxxxxxx>
Cc: Damien Lespiau <damien.lespiau@xxxxxxxxx>
Cc: Imre Deak <imre.deak@xxxxxxxxx>
Cc: Sunil Kamath <sunil.kamath@xxxxxxxxx>
Signed-off-by: Animesh Manna <animesh.manna@xxxxxxxxx>
Signed-off-by: Vathsala Nagaraju <vathsala.nagaraju@xxxxxxxxx>
---
  drivers/gpu/drm/i915/intel_csr.c | 9 +++++++++
  1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_csr.c b/drivers/gpu/drm/i915/intel_csr.c
index ba1ae03..682cc26 100644
--- a/drivers/gpu/drm/i915/intel_csr.c
+++ b/drivers/gpu/drm/i915/intel_csr.c
@@ -252,6 +252,15 @@ void intel_csr_load_program(struct drm_device *dev)
  		return;
  	}
+	/*
+	 * Dmc will restore the csr the program except DC9, cold boot,
+	 * warm reset, PCI function level reset, and hibernate/suspend.
+	 * This condition will help to check if csr address space is reset/
+	 * not loaded.
+	 */
Atm we call this from driver load and resume, which doesn seem to cover
all the cases you mention in the comment. Should this be a WARN_ON
instead? Or do we have troubles in our init sequence where we load too
many times?
Yes, the above statement taken from bspec to describe about the special cases dmc will not restore the firmware.
Agree, In our cases cold boot and hibernate/suspend mainly we need to load the firmware again, so in my
second sentence I wanted to comment mainly regarding this condition check added for suspend-hibernate(reset)
and cold boot(not loaded).

Anyways the same api later can be used to load the firmware from anywhere, so my intention to check firmware loaded or not.
If already loaded then not to overwrite the csr address space to maintain the dc5/dc6 counter value.

Can the below comment more clear to you.

	/*
	 * Dmc will restore the csr the program except DC9, cold boot,
	 * warm reset, PCI function level reset, and hibernate/suspend.
	 * If firmware is restored by dmc then no need to load again which
	 * will keep the dc5/dc6 counter exposed by firmware.
	 */

No issue in init sequence.
That seems to still cover all the callers of the function afaics - we do
pci resets over suspend resume unconditionally. So I still don't
understand where exactly we try to load the dmc firmware in i915.ko when
it's already loaded.
During resume intel_csr_load_program() will be called from
intel_runtime_resume().

intel_runtime_resume()-> skl_resume_prepare()-> intel_csr_load_program()

During Pc10 entry testing I can see dmc is restoring back the firmware always,
but as you mentioned pci-reset can happen unconditionally, but still then
also during resume intel_runtime_resume() will be called and based on
register read of csr-base-address firmware loading will happen.
But in your comment you're saying it won't get restored in case of dc9 and
suspend. So that seems to mismatch what you're saying here (and what the
commit message says) and what the code does. And this function here is
called for resume after suspend/hibernate only.
pc10 entry explanation I told is for skylake. dc9 in skylake is not possible.
I think you are confusing between dc6 and dc9. Pc10 can be achieved by
entering into dc6 (not dc9) for skylake. dc9 is the lowest possible state
for broxton which is not present for skylake.
I have no idea at all about different pc levels on skl. What I'm talking
about is system suspend/resume and driver load, which are the places this
function gets called. At least afaics.

Here intel_csr_load_program() will be used for both skylake and broxton, and instruction
execution flow will be different in case of suspend/resume which I think is confusing
you.
That seems like really important information. What's different on bxt?
These are the kind of details you should explain in the commit message ...

I am ready explain you in detail. It will be good if we discuss specific use-case scenario
and itz software design for specific platform. Another point - as dmc related code for
broxton is not merged better first we close design for skylake. Now, I have added dc9
description in comment thinking of future. If you want I can remove for now and later
can add in bxt patch series for enabling dmc. Will wait for your reply.
This question here isn't about the overall design and how to handle power
wells in skl/bxt. That's a separate discussion and tracked somewhere else.
I'm really just confused about when exactly we need to reload to firmware,
and why we need a runtime check for that. Normally we should know when to
reload the firmware and just either reload or not, without checking hw
state. And I don't like checking for hw state since at least in the past
that kind of code ended up being fragile - it's an illusion that it does
the right thing no matter what, since often there's other tricky ordering
constraints. And if you have automatic duct-tape like then no one will
ever spot those other, harder to spot issues, until an expensive customer
escalation happens.

So what I want to know here is:
- When exactly do we need to reload dmc firmware.
In skl, during driver load first time we load the firmware, during 
normal suspend-resume (dc6 entry/exit)
no need to reload the firmware again as dmc will take care of it. But 
during suspend/hibernation
dmc will not restore the firmware. In that case driver need to reload it 
again. I do not know
how to differentiate pm-suspend and suspend-hibernation and thought both 
the cases
intel_runtime_resume() will be called where we can check the h/w state 
and reload the
firmware if dmc is not restored.

In bxt, during driver load first time we load the firmware, during 
normal suspend-resume
display engine will enter into dc9 and dmc will not restore the 
firmware. So every
suspend-resume we need to reload the firmware.
- What exactly is the reason why we can't make that decision statically in
   the code (by calling csr_load at the right spots).
As I mentioned before in case of skylake can we differentiate between
"resume from pm-suspend" with "resume from suspend-hibernation" inside 
driver?

In case of broxton, every time we need to reload, so we can decide 
statically.

-Animesh

Thanks, Daniel

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx