On Tue, Jun 16, 2015 at 02:54:49PM +0100, Chris Wilson wrote: > On Tue, Jun 16, 2015 at 03:48:09PM +0200, Daniel Vetter wrote: > > On Mon, Jun 08, 2015 at 06:33:59PM +0100, Chris Wilson wrote: > > > On Mon, Jun 08, 2015 at 06:03:21PM +0100, Tomas Elf wrote: > > > > In preparation for per-engine reset add way for setting context reset stats. > > > > > > > > OPEN QUESTIONS: > > > > 1. How do we deal with get_reset_stats and the GL robustness interface when > > > > introducing per-engine resets? > > > > > > > > a. Do we set context that cause per-engine resets as guilty? If so, how > > > > does this affect context banning? > > > > > > Yes. If the reset works quicker, then we can set a higher threshold for > > > DoS detection, but we still do need Dos detection? > > > > > > > b. Do we extend the publically available reset stats to also contain > > > > per-engine reset statistics? If so, would this break the ABI? > > > > > > No. The get_reset_stats is targetted at the GL API and describing it in > > > terms of whether my context is guilty or has been affected. That is > > > orthogonal to whether the reset was on a single ring or the entire GPU - > > > the question is how broad do want the "affected" to be. Ideally a > > > per-context reset wouldn't necessarily impact others, except for the > > > surfaces shared between them... > > > > gl computes sharing sets itself, the kernel only tells it whether a given > > context has been victimized, i.e. one of it's batches was not properly > > executed due to reset after a hang. > > So you don't think we should delete all pending requests that depend > upon state from the hung request? Tbh I haven't fully thought through what happens with partial resets. Looking into the future with hardware faulting/svm it's clear that soonish the kernel won't even be in a position to know depencies. And userspace already needs to take any kind of texture sharing into account when computing certain arb_robustness values. Given that I'm leaning towards a lean implementation in the kernel of only marking the actual victim batches/contexts and simply continuing to execute everything else. That has a bit the risk of ending up in continual resets if a bit of corruption causes all follow-up batches to fail, but that's something we need to be able to handle (using a full-blown reset where we throw away all the batches) anyway. And eventually even escalating to refusing gpu accesses to repeat offenders. But definitely something we need to decide upon, and something which needs to be carefully tested with nasty igts for all corner cases. And preferrably also at least some basic multi-context testcases on top of mesa/libva robustness. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx