On Fri, Feb 19, 2016 at 12:08:14PM +0000, Tvrtko Ursulin wrote: > > Hi, > > On 11/01/16 10:44, Chris Wilson wrote: > >[ 196.988204] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large: > >[ 196.988512] clocksource: 'refined-jiffies' wd_now: ffff9b48 wd_last: ffff9acb mask: ffffffff > >[ 196.988559] clocksource: 'tsc' cs_now: 4fcfa84354 cs_last: 4f95425e98 mask: ffffffffffffffff > >[ 196.992115] clocksource: Switched to clocksource refined-jiffies > > > >Followed by a hard lockup. > > What does the above mean? Irq handler taking too long interferes > with time keeping ? That's exactly what it means, we run for too long in interrupt context (i.e. with interrupts disabled). > I like it BTW. Just the commit message needs more work. :) > > How is performance impact with just this patch in isolation? Worth > progressing it on its own? I only looked for regressions, which I didn't find. It fixes a machine freeze/panic, so I wasn't looking for any other reason to justify the patch! > >-void intel_lrc_irq_handler(struct intel_engine_cs *ring) > >+static int intel_execlists_submit(void *arg) > > { > >- struct drm_i915_private *dev_priv = ring->dev->dev_private; > >- u32 status_pointer; > >- u8 read_pointer; > >- u8 write_pointer; > >- u32 status = 0; > >- u32 status_id; > >- u32 submit_contexts = 0; > >+ struct intel_engine_cs *ring = arg; > >+ struct drm_i915_private *dev_priv = ring->i915; > > > >- status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(ring)); > >+ set_rtpriority(); > > > >- read_pointer = ring->next_context_status_buffer; > >- write_pointer = GEN8_CSB_WRITE_PTR(status_pointer); > >- if (read_pointer > write_pointer) > >- write_pointer += GEN8_CSB_ENTRIES; > >+ do { > >+ u32 status; > >+ u32 status_id; > >+ u32 submit_contexts; > >+ u8 head, tail; > > > >- spin_lock(&ring->execlist_lock); > >+ set_current_state(TASK_INTERRUPTIBLE); > > Hm, what is the effect of setting TASK_INTERRUPTIBLE at this stage > rather than just before the call to schedule? We need to set the task state before doing the checks, so that we do not miss a wakeup from the interrupt handler. Otherwise, we may check the register, decide there is nothing to do, be interrupted by the irq handler which then sets us to TASK_RUNNING, and then proceed with going to sleep by setting TASK_INTERRUPTIBLE (and so missing that the context-switch had just occurred and sleep forever). > And why interruptible? To hide ourselves from contributing to the system load and appear as sleeping (rather than blocked) in the process lists. > >+ while (head++ < tail) { > >+ status = I915_READ(RING_CONTEXT_STATUS_BUF_LO(ring, head % GEN8_CSB_ENTRIES)); > >+ status_id = I915_READ(RING_CONTEXT_STATUS_BUF_HI(ring, head % GEN8_CSB_ENTRIES)); > > One potentially cheap improvement I've been thinking of is to move > the CSB reading outside the execlist_lock, to avoid slow MMIO > contending the lock. Yes, that should be a small but noticeable improvement. > We could fetch all valid entries into a temporary local buffer, then > grab the execlist_lock and process completions and submission from > there. If you look at the next patch, you will see that's what I did do :) > >- ring->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES; > >+ spin_unlock(&ring->execlist_lock); > > Would it be worth trying to grab the mutex and unpin the LRCs at > this point? It would slow down the thread a bit but would get rid of > the retired work queue. With the vma->iomap thingy could be quite > cheap? Exactly, we want the iomap/vmap caching thingy first :) But the retired work queue disappears as a fallout of your previous-context idea anyway plus the fix to avoid the struct_mutex when freeing requests. > >- /* Update the read pointer to the old write pointer. Manual ringbuffer > >- * management ftw </sarcasm> */ > >- I915_WRITE(RING_CONTEXT_STATUS_PTR(ring), > >- _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, > >- ring->next_context_status_buffer << 8)); > >+ WARN(submit_contexts > 2, "More than two context complete events?\n"); > >+ ring->next_context_status_buffer = tail % GEN8_CSB_ENTRIES; > >+ I915_WRITE(RING_CONTEXT_STATUS_PTR(ring), > >+ _MASKED_FIELD(GEN8_CSB_PTR_MASK << 8, > >+ ring->next_context_status_buffer<<8)); > >+ } while (1); > > } > > > > static int execlists_context_queue(struct drm_i915_gem_request *request) > >@@ -585,7 +582,7 @@ static int execlists_context_queue(struct drm_i915_gem_request *request) > > > > i915_gem_request_get(request); > > > >- spin_lock_irq(&engine->execlist_lock); > >+ spin_lock(&engine->execlist_lock); > > > > list_for_each_entry(cursor, &engine->execlist_queue, execlist_link) > > if (++num_elements > 2) > >@@ -611,7 +608,7 @@ static int execlists_context_queue(struct drm_i915_gem_request *request) > > if (num_elements == 0) > > execlists_context_unqueue(engine); > > > >- spin_unlock_irq(&engine->execlist_lock); > >+ spin_unlock(&engine->execlist_lock); > > Hm, another thing which could be quite cool is if here we could just > wake the thread and let it submit the request instead of doing it > directly from 3rd party context. Yes, this is something I played around with, and my conclusion was that the extra overhead from calling ttwu (try_to_wake_up) on the majority of workloads outweighed the few workloads that benefitted. > That would make all ELSP accesses serialized for free since they > would only be happening from a single thread. And potentially could > reduce the scope of the lock even more. > > But it would mean extra latency when transitioning the idle engine > to busy. Maybe it wouldn't matter for such workloads. Yup. I saw greater improvement from reducing the overhead along the execlists context-switch handling path than the adding of requests. There is certainly plenty of scope for improvement though. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx