On 07/01/2019 12:16, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-01-07 11:58:13)
Hi,
This series has not been recognized by Patchwork as such, nor are the
patches numbered. Have you used git format-patch -<N> --cover-letter and
git send-email to send it out?
Rest inline.
On 05/01/2019 02:39, Carlos Santa wrote:
+static void gen8_watchdog_irq_handler(unsigned long data)
+{
+ struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+ struct drm_i915_private *dev_priv = engine->i915;
+ enum forcewake_domains fw_domains;
+ u32 current_seqno;
+
+ switch (engine->id) {
+ default:
+ MISSING_CASE(engine->id);
+ /* fall through */
+ case RCS:
+ fw_domains = FORCEWAKE_RENDER;
+ break;
+ case VCS:
+ case VCS2:
+ case VECS:
+ fw_domains = FORCEWAKE_MEDIA;
+ break;
+ }
+
+ intel_uncore_forcewake_get(dev_priv, fw_domains);
I'd be tempted to drop this and just use I915_WRITE. It doesn't feel
like there is any performance to be gained with it and it embeds too
much knowledge here.
No, no, no. Let's not reintroduce a fw inside irq context on a frequent
timer again.
Tasklet and hopefully watchdog timeouts are not frequent. :)
Rule of thumb for fw_get:
gen6+: 10us to 50ms.
gen8+: 10us to 500us.
And then we don't release fw for 1ms after the fw_put. So we basically
prevent GT powersaving while the watchdog is active. That strikes me as
hopefully an unintended consequence.
The fw_get will be required if we actually hang, but for the timer
check, we should be able to do without.
That would be nice, but it is needed by the watchdog disable/re-enable
logic. Which I commented looks suspect to me so maybe something can be
done about that.
But in general, I didn't quite get if you are opposed to my suggestion
not to open code knowledge of fw domains here in favour of simple
I915_WRITE, or just the whole concept of taking a fw by any means here.
And while on the topic of the timer irq, it should be forcibly cleared
along intel_engine_park, so that we ensure it is not raised while the
device/driver is supposed to be asleep. Or something to that effect.
I have raised the issue of syncing the potentially delayed tasklet, but
yeah, could be that more is needed.
+ current_seqno = intel_engine_get_seqno(engine);
+
+ /* did the request complete after the timer expired? */
+ if (intel_engine_last_submit(engine) == current_seqno)
+ goto fw_put;
+
+ if (engine->hangcheck.watchdog == current_seqno) {
+ /* Make sure the active request will be marked as guilty */
+ engine->hangcheck.stalled = true;
+ engine->hangcheck.acthd = intel_engine_get_active_head(engine);
+ engine->hangcheck.seqno = current_seqno;
+
+ /* And try to run the hangcheck_work as soon as possible */
+ set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
+ queue_delayed_work(system_long_wq,
+ &dev_priv->gpu_error.hangcheck_work,
+ round_jiffies_up_relative(HZ));
+ } else {
+ engine->hangcheck.watchdog = current_seqno;
The logic above potentially handles my previous question? Could be if
batch 2 hangs. But..
Also, DO NOT USE HANGCHECK for this. The whole design was to be able to
do the engine reset right away. (Now guc can't but that's known broken.)
Aside, we have to rewrite this entire logic anyway as the engine seqno
and global_seqno are obsolete.
Btw one thing I forgot to say - I did not focus on the hangcheck
interactions - I'll leave that for people more in the know of that.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx