Quoting Linus Torvalds (2020-04-07 18:28:34) > On Tue, Apr 7, 2020 at 9:04 AM Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote: > > > > It'll take some time to reconstruct the original report, but the case in > > question was in removing the last element of the list of the last list, > > switch to a global lock over all such lists to park the HW, which in > > doing so added one more element to the original list. [If we happen to > > be retiring along the kernel timeline in the first place.] > > Please point to the real code and the list. > > Honestly, what you describe sounds complex enough that I think your > locking is simply just buggy. > > IOW, this patch seems to really just paper over a locking bug, and the > KASAN report tied to it. > > Because the fundamental issue is that READ_ONCE() can not fix a bug > here. Reading the next pointer once fundamentally cannot matter if it > can change concurrently: the code is buggy, and the READ_ONCE() just > means that it gets one or the other value randomly, and that the list > walking is fundamentally racy. > > One the other hand, if the next pointer _cannot_ change concurrently, > then READ_ONCE() cannot make a difference. > > So as fat as I can tell, we have two possibilities, and in both cases > changing the code to use READ_ONCE() is not the right thing to do. In > one case it hides a bug, and in another case it's just pointless. > > > list->next changed from pointing to list_head, to point to the new > > element instead. However, we don't have to check the next element yet > > and want to terminate the list iteration. > > I'd really like to see the actual code that has that list walking. You say: > > > For reference, > > drivers/gpu/drm/i915/gt/intel_engine_pm.c::__engine_park() > > .. but that function doesn't have any locking or list-walking. Is it > the "call_idle_barriers()" loop? What is it? list_for_each_safe @ intel_gt_requests.c::retire_requests list_del @ i915_requests.c::i915_request_retire list_add @ i915_requests.c::__i915_request_create > I'd really like to see the KASAN report and the discussion about this change. > > And if there was no discussion, then the patch just seems like "I > changed code randomly and the KASAN report went away, so it's all > good". > > > Global activity is serialised by engine->wakeref.mutex; every active > > timeline is required to hold an engine wakeref, but retiring is local to > > timelines and serialised by their own timeline->mutex. > > > > lock(&timeline->lock) > > list_for_each_safe(&timeline->requests) > > \-> i915_request_retire [list_del(&timeline->requests)] > > \-> intel_timeline_exit > > \-> lock(&engine->wakeref.mutex) > > engine_park [list_add_tail(&engine->kernel_context->timeline->requests)] > > in that particular list_for_each_safe() thing, there's no possibility > that the 'next' field would be reloaded, since the list_del() in the > above will be somethign the compiler is aware of. > > So yes, the beginning list_for_each_safe() might load it twice (or a > hundred times), but by the time that list_del() in > i915_request_retire() has been called, if the compiler then reloads it > afterwards, that would be a major compiler bug, since it's after that > value could have been written in the local thread. My understanding of how KCSAN works is that it installs a watchpoint after a read and complains if the value changes within X us. > So this doesn't explain it to me. > > What it *sounds* like is that the "engine" lock that you do *not* hold > initially, is not protecting some accessor to that list, so you have a > race on the list at the time of that list_del(). We take an engine wakeref prior to creating a request, and release it upon retiring the request [through the respective context/timelines, across all the engines that might be used for that context]. When the engine wakeref hits zero, that mutex is taken to prevent any other new request being created, and under that mutex while we know that the engine->kernel_context->timeline is empty, we submit a request to switch the GPU to point into our scratch context. That submission can run concurrently to the list iteration, but only _after_ the final list_del. > And that race may be what KASAN is reporting, and what that patch is > _hiding_ from KASAN - but not fixing. Yes, it was sent as a means to silence KCSAN with a query as to whether or not it could happen in practice with an aggressive compiler. > See what I am saying and why I find this patch questionable? > > There may be something really subtle going on, but it really smells > like "two threads are modifying the same list at the same time". In strict succession. > And there's no way that the READ_ONCE() will fix that bug, it will > only make KASAN shut up about it. There's some more shutting up required for KCSAN to bring the noise down to usable levels which I hope has been done so I don't have to argue for it, such as diff --git a/include/linux/timer.h b/include/linux/timer.h index 1e6650ed066d..c7c8dd89f279 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -164,7 +164,7 @@ static inline void destroy_timer_on_stack(struct timer_list *timer) { } */ static inline int timer_pending(const struct timer_list * timer) { - return timer->entry.pprev != NULL; + return READ_ONCE(timer->entry.pprev) != NULL; } extern void add_timer_on(struct timer_list *timer, int cpu); diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 5352ce50a97e..7461b3f33629 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -565,8 +565,9 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, /* * Use vcpu_is_preempted to detect lock holder preemption issue. */ - if (!owner->on_cpu || need_resched() || - vcpu_is_preempted(task_cpu(owner))) { + if (!READ_ONCE(owner->on_cpu) || + need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -602,7 +603,7 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) * on cpu or its cpu is preempted */ if (owner) - retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); + retval = READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 1f7734949ac8..4a81fba4cf70 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -75,7 +75,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, * wait for either @lock to point to us, through its Step-B, or * wait for a new @node->next from its Step-C. */ - if (node->next) { + if (READ_ONCE(node->next)) { next = xchg(&node->next, NULL); if (next) break; @@ -154,7 +154,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) */ for (;;) { - if (prev->next == node && + if (READ_ONCE(prev->next) == node && cmpxchg(&prev->next, node, NULL) == node) break; diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 0d9b6be9ecc8..eef4835cecf2 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -650,7 +650,7 @@ static inline bool owner_on_cpu(struct task_struct *owner) * As lock holder preemption issue, we both skip spinning if * task is not on cpu or its cpu is preempted */ - return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); + return READ_ONCE(owner->on_cpu) && !vcpu_is_preempted(task_cpu(owner)); } static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,