On Fri, Aug 05, 2016 at 08:01:58PM +0100, Chris Wilson wrote: > On Fri, Aug 05, 2016 at 08:50:11PM +0200, Daniel Vetter wrote: > > On Fri, Aug 05, 2016 at 10:05:53AM +0100, Chris Wilson wrote: > > > The error state is purposefully racy as we expect it to be called at any > > > time and so have avoided any locking whilst capturing the crash dump. > > > However, with multi-engine GPUs and multiple CPUs, those races can > > > manifest into OOPSes as we attempt to chase dangling pointers freed on > > > other CPUs. Under discussion are lots of ways to slow down normal > > > operation in order to protect the post-mortem error capture, but what it > > > we take the opposite approach and freeze the machine whilst the error > > > capture runs (note the GPU may still running, but as long as we don't > > > process any of the results the driver's bookkeeping will be static). > > > > > > Note that by of itself, this is not a complete fix. It also depends on > > > the compiler barriers in list_add/list_del to prevent traversing the > > > lists into the void. > > > > The other important bit I think are NULL checks. I think the commit > > message should mention that too. > > Not so convinced here. My idea for GPU error capture is that we > basically grab the bad request and then probe from there. The only list > (and pointer chasing) I want to walk in a dangerous manner is the request > list. Everything we need to report should be derivable from that > request, and as the request is reachable from the list we know it is > intact. I guess we could err on !request->fence.refcount and run away. Yeah, this story is more convincing, would be good to add it. -Daniel > > > > v2: Avoid drm_clflush_pages() inside stop_machine() as it may use > > > stop_machine() itself for its wbinvd fallback. > > > > Droppingt he clflush from error capture seems like a pretty big > > regression, at least at a glance. Why does this not result in piles of > > corrupted error state captures? > > We replace the clflush with an ever bigger complete cache flush. > > However... > > > I guess since we need to flush on incoherent platforms before gpu access > > anyway this is mostly true (as long as userspace doesn't do something > > silly), but I think it should be captured in a comment at the place of the > > clflush. > > Indeed. All the data we read has been flushed out of the cpu caches. > If not, then right there is a reason for a hang. It's moot as I > propose just using the GTT for accessing every page since it is the only > universal method we have - and it has been my preference for GPU capture > because it gives us coherent data as the GPU would see it. We only > stopped doing so more recently when the batches were not accessible from > the GTT. > -Chris > > -- > Chris Wilson, Intel Open Source Technology Centre -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx