Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > Quoting Mika Kuoppala (2020-03-31 08:50:54) >> Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: >> >> > process_csb: 0000:00:02.0 bcs0: cs-irq head=4, tail=5 >> > process_csb: 0000:00:02.0 bcs0: csb[5]: status=0x00008002:0x60000020 >> > trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 } >> > trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 } >> > trace_ports: 0000:00:02.0 bcs0: submit { ff84:45156 prio 2 } >> > >> > process_csb: 0000:00:02.0 bcs0: cs-irq head=5, tail=6 >> > process_csb: 0000:00:02.0 bcs0: csb[6]: status=0x00000018:0x60000020 >> > trace_ports: 0000:00:02.0 bcs0: completed { ff84:45155* prio 2 } >> > process_csb: 0000:00:02.0 bcs0: ring:{start:0x00178000, head:0928, tail:0928, ctl:00000000, mode:00000200} >> > process_csb: 0000:00:02.0 bcs0: rq:{start:00178000, head:08b0, tail:08f0, seqno:ff84:45155, hwsp:45156}, >> > process_csb: 0000:00:02.0 bcs0: ctx:{start:00178000, head:e000928, tail:0928}, >> > process_csb: GEM_BUG_ON("context completed before request") >> > >> > In this sequence, we can see that although we have submitted the next >> > request [ff84:45156] to HW (via ELSP[]) it has not yet reported the >> > lite-restore. Instead, we see the completion event of the currently >> > active request [ff85:45155] but at the time of processing that event, >> > the breadcrumb has not yet been written. Though by the time we do print >> > out the debug info, the seqno write of ff85:45156 has landed! >> >> I see it. >> >> But I have now a more generic confusion about the sequence: >> why would the same context preempt itself? > > lite-restore continuations. > >> >> > trace_ports: 0000:00:02.0 bcs0: preempted { ff84:45154! prio 2 } >> > trace_ports: 0000:00:02.0 bcs0: promote { ff84:45155* prio 2 } >> >> > >> > Therefore there is a serialisation problem between the seqno writes and >> > CS events, not just between the CS buffer and its head/tail pointers as >> > previously observed on Icelake. >> > >> > This is not a huge problem, as we don't strictly rely on the breadcrumb >> > to determine HW activity, but it may indicate that interrupt delivery is >> > before the seqno write, aka bringing back the plague of missed >> > interrupts from yesteryear. However, there is no indication of this >> > wider problem, so let's just flush the seqno read before reporting an >> > error. If it persists after the fresh read we can worry again. >> >> Well you are preempting part of my concerns. >> >> Still you will have a serialision point now only with debug builds. > > We only check in debug. > >> So I assume the rest of the request completion flow is ambivalent of >> intr seqno coherency and we can cope with this is only for debug... > > Right. > >> Being in generic read hwsp path would be expensive and thus overkill? > > Very expensive. There's no evidence that the GPU is that broken that it > returns garbage in a *cached* CPU read. > >> Could we lift the flushing to the non show_debug path to be always >> part of process_csb cycle? > > No. What would be the point? The breadcrumb is not part of the CS flow. > The debug is our sanitycheck that the two events are ordered, but that > is not a strict requirement of the code. If it fails this test, the > presumption is that either the GPU is very dead (e.g. the GO problem, > RING_HEAD/TAIL issues, a variety of other means of killing the gpu) and > the problem will remain, or that we have a programming error, which > again will not resolve itself as we print the debug. Agreed that there is no point if it is for forcing a fresh read only for debug. Concern was that it affects the validity for other reads. But you amended that on the commit message, preemptively. So, Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx