On 16/06/15 14:54, Chris Wilson wrote: > On Tue, Jun 16, 2015 at 03:48:09PM +0200, Daniel Vetter wrote: >> On Mon, Jun 08, 2015 at 06:33:59PM +0100, Chris Wilson wrote: >>> On Mon, Jun 08, 2015 at 06:03:21PM +0100, Tomas Elf wrote: >>>> In preparation for per-engine reset add way for setting context reset stats. >>>> >>>> OPEN QUESTIONS: >>>> 1. How do we deal with get_reset_stats and the GL robustness interface when >>>> introducing per-engine resets? >>>> >>>> a. Do we set context that cause per-engine resets as guilty? If so, how >>>> does this affect context banning? >>> >>> Yes. If the reset works quicker, then we can set a higher threshold for >>> DoS detection, but we still do need Dos detection? >>> >>>> b. Do we extend the publically available reset stats to also contain >>>> per-engine reset statistics? If so, would this break the ABI? >>> >>> No. The get_reset_stats is targetted at the GL API and describing it in >>> terms of whether my context is guilty or has been affected. That is >>> orthogonal to whether the reset was on a single ring or the entire GPU - >>> the question is how broad do want the "affected" to be. Ideally a >>> per-context reset wouldn't necessarily impact others, except for the >>> surfaces shared between them... >> >> gl computes sharing sets itself, the kernel only tells it whether a given >> context has been victimized, i.e. one of it's batches was not properly >> executed due to reset after a hang. > > So you don't think we should delete all pending requests that depend > upon state from the hung request? > -Chris John Harrison & I discussed this yesterday; he's against doing so (even though the scheduler is ideally placed to do it, if that were actually the preferred policy). The primary argument (as I see it) is that you actually don't and can't know the nature of an apparent dependency between batches that share a buffer object. There are at least three cases: 1. "tightly-coupled": the dependent batch is going to rely on data produced by the earlier batch. In this case, GIGO applies and the results will be undefined, possibly including a further hang. Subsequent batches presumably belong to the same or a closely-related (co-operating) task, and killing them might be a reasonable strategy here. 2. "loosely-coupled": the dependent batch is going to access the data, but not in any way that depends on the content (for example, blitting a rectangle into a composition buffer). The result will be wrong, but only in a limited way (e.g. window belonging to the faulty application will appear corrupted). The dependent batches may well belong to unrelated system tasks (e.g. X or surfaceflinger) and killing them is probably not justified. 3. "uncoupled": the dependent batch wants the /buffer/, not the data in it (most likely a framebuffer or similar object). Any incorrect data in the buffer is irrelevant. Killing off subsequent batches would be wrong. Buffer access mode (readonly, read/write, writeonly) might allow us to distinguish these somewhat, but probably not enough to help make the right decision. So the default must be *not* to kill off dependants automatically, but if the failure does propagate in such a way as to cause further consequent hangs, then the context-banning mechanism should eventually catch and block all the downstream effects. .Dave. _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx