On Fri, Oct 09, 2015 at 12:30:29PM +0100, Tomas Elf wrote: > On 09/10/2015 08:48, Chris Wilson wrote: > >On Thu, Oct 08, 2015 at 07:31:37PM +0100, Tomas Elf wrote: > >>Sometimes the iterated vma objects are NULL apparently. Be aware of this while > >>iterating and do early exit instead of causing a NULL pointer exception. > > > >Wrong. > >-Chris > > > > So the NULL pointer exception that I ran into multiple times during > several different test runs on the line that says "vma->pin_count" > was not because the vma pointer was NULL. Would you mind sharing > your explanation to how this might have happened? Is it because > we're not synchronizing and there's no protection against the driver > deallocating vmas while we're trying to capture them? If this all > ties into the aforementioned RCU-based solution then maybe we should > go with that then. Correct. The driver is retiring requests whilst the hang check worker is running. And you chased a stale pointer, you could have equally chased a stale vma->obj, vma->ctx etc pointers. What I have done in the past is to serialise the retirement and the hangcheck using a spinlock (as adding to the end of the list is safe), but there are still weak spots when walking the list of bound vma. What I would actually feel more comfortable with is to only record the request->batch_vma, at the cost of losing information about the currently bound buffers. Or we could just stop_machine() whilst running the capture and have the machine unresponsive for a few 100ms. I don't think simply RCU the lists is enough (VM active_list, request list, bound list) as eventually we chase a pointer from obj itself (which means we need to RCU pretty much everything). -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx