Re: [PATCH 15/16] intel_l3_parity: Support error injection

Ben Widawsky <ben@xxxxxxxxxxxx> · Fri, 13 Sep 2013 09:29:56 -0700

On Fri, Sep 13, 2013 at 06:14:38PM +0200, Daniel Vetter wrote:
> On Fri, Sep 13, 2013 at 5:54 PM, Ben Widawsky <ben@xxxxxxxxxxxx> wrote:
> > On Fri, Sep 13, 2013 at 11:12:11AM +0200, Daniel Vetter wrote:
> >> On Thu, Sep 12, 2013 at 10:28:41PM -0700, Ben Widawsky wrote:
> >> > Haswell added the ability to inject errors which is extremely useful for
> >> > testing. Add two arguments to the tool to inject, and uninject.
> >> >
> >> > Signed-off-by: Ben Widawsky <ben@xxxxxxxxxxxx>
> >>
> >> Do we run any risk that a concurrent write/read to the same register range
> >> could hang the machine due to the same-cacheline w/a we need? Just want to
> >> make sure that when we integrate this into a testcase there's no surprises
> >> like with intel_gpu_top ...
> >> -Daniel
> >
> > The race against the kernel is ever present on all tests/tools. Are we
> > running parallel igt yet? If so, I can make the read/write functions
> > threadsafe.
> >
> > On this note in particular I suppose we can make a debugfs entry like
> > the forcewake one to allow user space to do register accesses.
> >
> > Interestingly, this also reminds me of another caveat I meant to put in
> > the commit message and forgot... the error injection register is also
> > per context, which makes it a pain to clear (and the pain in writing the
> > test case). I'm even beginning to think maybe a debugfs for this
> > register is the way to go.
> >
> > As a side note, the injection feature is entirely debug only - but
> > agreed, random hangs in the test suite is not good.
> 
> Hm, this will be tricky. If nothing else writes this range (i.e. not
> our interrupt handler) we could use a secure batchbuffer and emit the
> MI_LRI from the userspace batch. Then we could submit some workload
> using hw contexts that uses the l3$ cache (I guess without something
> in there it won't notice the injected error) and after the error is
> detected we could simply kill the context, restoring the original
> state again.
> -Daniel

Actually, I don't think there is anything else used in the cacheline of
the error injection register which are accessed after driver load.

-- 
Ben Widawsky, Intel Open Source Technology Center
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx