On ti, 2016-09-20 at 09:29 +0100, Chris Wilson wrote: > The error state is purposefully racy as we expect it to be called at any > time and so have avoided any locking whilst capturing the crash dump. > However, with multi-engine GPUs and multiple CPUs, those races can > manifest into OOPSes as we attempt to chase dangling pointers freed on > other CPUs. Under discussion are lots of ways to slow down normal > operation in order to protect the post-mortem error capture, but what it > we take the opposite approach and freeze the machine whilst the error > capture runs (note the GPU may still running, but as long as we don't > process any of the results the driver's bookkeeping will be static). > > Note that by of itself, this is not a complete fix. It also depends on > the compiler barriers in list_add/list_del to prevent traversing the > lists into the void. We also depend that we only require state from > carefully controlled sources - i.e. all the state we require for > post-mortem debugging should be reachable from the request itself so > that we only have to worry about retrieving the request carefully. Once > we have the request, we know that all pointers from it are intact. > > v2: Avoid drm_clflush_pages() inside stop_machine() as it may use > stop_machine() itself for its wbinvd fallback. > I'm rather sure I already reviewed this code-wise, pending for an A-b from Daniel. Regards, Joonas -- Joonas Lahtinen Open Source Technology Center Intel Corporation _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx