On Tue, Oct 13, 2015 at 01:37:32PM +0200, Daniel Vetter wrote: > On Fri, Oct 09, 2015 at 12:40:51PM +0100, Tomas Elf wrote: > > On 09/10/2015 09:27, Daniel Vetter wrote: > > >On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote: > > >>Using safe list iterators alleviates the problem of unsynchronized driver list > > >>manipulations while error state capture is in the process of traversing lists. > > >> > > >>Signed-off-by: Tomas Elf <tomas.elf@xxxxxxxxx> > > >>--- > > >> drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------ > > >> 1 file changed, 19 insertions(+), 19 deletions(-) > > >> > > >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c > > >>index 2f04e4f..32c1799 100644 > > >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c > > >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c > > >>@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err, > > >> static u32 capture_active_bo(struct drm_i915_error_buffer *err, > > >> int count, struct list_head *head) > > >> { > > >>- struct i915_vma *vma; > > >>+ struct i915_vma *vma, *tmpvma; > > >> int i = 0; > > >> > > >>- list_for_each_entry(vma, head, mm_list) { > > >>+ list_for_each_entry_safe(vma, tmpvma, head, mm_list) { > > > > > >_safe isn't safe against anything, it's only safe against removal of the > > >current list item done with a list_del/list_move from this thread. I.e. > > >the only thing it does is have a temporary pointer to the next list > > >element so if you delete the current one it won't fall over. > > > > > >It doesn't protect against anything else, and especially not against other > > >threads totally stomping the list. > > > > Absolutely! But it's good enough to make the test not fall over within an > > hour of testing and instead allow it to run for 12+ hours during continuous > > long-duration testing, which is what I need for the TDR validation. > > Well it looks really suspicious, since the only thing this protects > against is against deleting the element we look at right now. The only > sensible theory I can come up with is that this slightly helps when > starting the list walk, in case someone comes around and deletes the first > element in the list from the retire worker. Since the retire worker tends > to do more work per list item than we do that's enough to make the race > really unlikely. But it's still just duct-tape. > > > >So we need proper protection for these lists, independent of > > >dev->struct_mutex. The least invasive way to do that is probably rcu, > > >which only requires us that we release the memory for any object sitting > > >on these lists with kfree_rcu instead of normal kfree. Another option > > >might be a spinlock, but that would be a lot more invasive, and Chris > > >won't like the execbuf overhead this causes ;-) > > >-Daniel > > > > Awesome! Let's go with that then. > > See our massive irc discussion ... it's more tricky :( > > In short rcu works perfectly if you only have the following lifetime > sequence: > - kmalloc object > - add to list > - remove from list (eventually) > - free object > > If you readd an object to a list, or even worse, move it then the rcu list > helpers won't work. What could happen on the read side is: > > - you walk the rcu list, keeping track of the head element to know when to > stop > - while looking at that list one of the yet untraversed items gets moved > to a new list > > -> you'll traverse the new list forever since that one will never hit the > head element. > > Not a problem for requests/vmas, but a problem for some of the object lists. > > Note that the problem isn't that we re-add the element (if that happens we > might end up walking parts of the list twice or parts of it not at all), > but moving to a different list where there's a different head element. > > I haven't checked, but maybe we're lucky and all the object lists we're > looking at will always have the same head element. Then I think it'll work > out (albeit racy, but who cares about that for the error capture) and with > the guarantee (when using rcu delayed freeing) that'll we'll never chase > freed memory. Ah, so you agree that stop_machine() is good enough. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx