Quoting Chris Wilson (2019-09-18 17:10:26) > Quoting Tvrtko Ursulin (2019-09-18 16:54:36) > > > > On 17/09/2019 16:17, Chris Wilson wrote: > > > Quoting Tvrtko Ursulin (2019-09-17 15:59:25) > > >> > > >> On 16/09/2019 12:38, Chris Wilson wrote: > > >>> When using virtual engines, the rq->engine is not stable until we hold > > >>> the engine->active.lock (as the virtual engine may be exchanged with the > > >>> sibling). Since commit 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy") > > >>> we may retire a request concurrently with resubmitting it to HW, we need > > >>> to be extra careful to verify we are holding the correct lock for the > > >>> request's active list. This is similar to the issue we saw with > > >>> rescheduling the virtual requests, see sched_lock_engine(). > > >>> > > >>> Or else: > > >>> > > >>> <4> [876.736126] list_add corruption. prev->next should be next (ffff8883f931a1f8), but was dead000000000100. (prev=ffff888361ffa610). > ... > > >>> <4> [876.736415] list_del corruption. prev->next should be ffff888361ffca10, but was ffff88840ac2c730 > > > > Yes. So preempt-to-busy introduces a window where the request is still > > > on HW but we have returned it back to the submission queue. We catch up > > > with the HW on the next process_csb, but it may have completed the > > > request in the mean time (it is just not allowed to advance beyond the > > > subsequent breadcrumb and so prevented from overtaking our knowledge of > > > RING_TAIL and so we avoid telling the HW to go "backwards".). > > > > Would it be sufficient to do: > > > > engine = READ_ONCE(rq->engine); > > spin_lock(...); > > list_del(...); > > spin_unlock(engine->active.lock); > > > > To ensure the same engine is used? Although the oops is not about > > spinlock but list corruption. How does the list get corrupt though? > > list_del does not care on which list the request is.. If it is really > > key to have the correct lock, then why it is enough to re-check the > > engine after taking the lock? Why rq->engine couldn't change under the > > lock again? rq->engine does get updated under the very lock, no? > > Don't forget that list_del changes the list around it: > list_del() { > list->prev->next = list->next; > list->next->prev = list->prev; > } > > rq->engine can't change under the real->active.lock, as the assignment > to rq->engine = (virtual, real) is made under the real->active.lock. > > execlists_dequeue: > real->active.lock > ve->active.lock > > __unwind_incomplete_requests: > real->active.lock > > Hmm. I trust the trick employed in the patch is well proven by this > point, but if we took the nested ve lock inside __unwind, do we need to > worry. Hmm. No. Take an extreme example where the request is taken from the virtual engine submitted to sibling0, then transferred to sibling1 all between the other thread doing engine = READ_ONCE(rq->engine); lock(engine); The engine that used to be on the request no longer correlates with the engine locking the sched.requests. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx