On Mon, 18 Jul 2022 17:27:28 -0700 Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > I've been tracking down an occasional hang at reboot on my system and > I've ended up at this as the first bad commit. I will not pretend to > understand the intricacies of the rwsem implementation, but I can > describe what I saw. I have also produced a fairly small test case > that reproduces the problem rather quickly. > > First, what I saw: > > My system failed to fully boot up and eventually the "hung task" > detection kicked in. Many tasks in my system were hung all waiting on > the "kernfs_rwsem". No tasks actually had the semaphore--it only had > tasks waiting. > > Of the tasks waiting, 3 of them were doing a down_write(). The rest > were all waiting on down_read(). > > 2 of the tasks waiting on the down_write() were locked to CPU0. One of > these tasks was a bound kworker. Another of these tasks was a threaded > IRQ handler. The threaded IRQ handler was set to "real time" priority > because in setup_irq_thread() you can see the call to > sched_set_fifo(). > > At the time the hung task detector kicked in, the real time task was > actually active on a CPU. Specifically it was running in the for (;;) > loop in rwsem_down_write_slowpath(). rwsem_try_write_lock() had > clearly just returned false which meant we didn't get the lock. > Everything else was sitting in schedule(). > > I managed to get the real time task into kgdb and I could analyze its > state as well as the state of "sem". The real time task was _not_ the > first waiter. The kworker was the first waiter. The > "waiter.handoff_set" was set to "true" for the real time task. The > rwsem owner was OWNER_NULL. > > Looking through the code and watching what was happening. > > 1. The function rwsem_try_write_lock() was instantly returning false > since `handoff` is set and we're not first. > 2. After we get back into rwsem_down_write_slowpath() we'll see the > handoff set and we'll try to spin on the owner. There is no owner, so > this is a noop. > 3. Since there's no owner, we'll go right back to the start of the loop. > > So basically the real time thread (the threaded IRQ handler) was > locked to CPU0 and spinning as fast as possible. The "first waiter" > for the semaphore was blocked from running because it could only run > on CPU0 but was _not_ real time priority. > > - > > So all the analysis above was done on the Chrome OS 5.15 kernel > branch, which has ${SUBJECT} patch from the stable tree. The code > looks reasonably the same on mainline. > > ...and also, I coded up a test case that can reproduce this on > mainline. It's ugly/hacky but it gets the job done. This reproduces > the problem at the top of mainline as of commit 80e19f34c288 ("Merge > tag 'hte/for-5.19' of > git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux"). > > For me, I was only able to reproduce this without "lockdep" enabled. > My lockdep configs were: > > CONFIG_DEBUG_RT_MUTEXES=y > CONFIG_DEBUG_SPINLOCK=y > CONFIG_DEBUG_MUTEXES=y > CONFIG_PROVE_RCU=y > CONFIG_PROVE_LOCKING=y > CONFIG_DEBUG_ATOMIC_SLEEP=y > > I don't know for sure if lockdep is actually required to reproduce. > > - > > OK, so here's my hacky test case. In my case, I put a call to this > test function in a convenient debugfs "show" function to make it easy > to trigger. You can put it wherever. > > struct test_data { > struct rw_semaphore *rwsem; > int i; > bool should_sleep; > }; > > static int test_thread_fn(void *data) > { > struct test_data *test_data = data; > struct rw_semaphore *rwsem = test_data->rwsem; > ktime_t start; > > trace_printk("Starting\n"); > start = ktime_get(); > while (ktime_to_ms(ktime_sub(ktime_get(), start)) < 60000) { > trace_printk("About to grab\n"); > down_write(rwsem); > trace_printk("Grabbed write %d\n", test_data->i); > schedule(); > up_write(rwsem); > trace_printk("Released write %d\n", test_data->i); > if (test_data->should_sleep) > msleep(1); > } > trace_printk("Done\n"); > > return 0; > } > > static void test(void) > { > static struct task_struct *t[10]; > static struct test_data test_data[10]; > static DECLARE_RWSEM(rwsem); > int i; > > trace_printk("About to create threads\n"); > > for (i = 0; i < ARRAY_SIZE(t); i++) { > test_data[i].rwsem = &rwsem; > test_data[i].i = i; > > if (i == ARRAY_SIZE(t) - 1) { > /* > * Last thread will be bound to CPU0 and realtime. > * Have it sleep to give other threads a chance to > * run and contend. > */ > test_data[i].should_sleep = true; > t[i] = kthread_create_on_cpu(test_thread_fn, > &test_data[i], 0, > "test0 FIFO-%u"); > sched_set_fifo(t[i]); > } else if (i == ARRAY_SIZE(t) - 2) { > /* 2nd to last thread will be bound to CPU0 */ > t[i] = kthread_create_on_cpu(test_thread_fn, > &test_data[i], 0, > "test0-%u"); > } else { > /* All other threads are just normal */ > t[i] = kthread_create(test_thread_fn, > &test_data[i], "test"); > } > wake_up_process(t[i]); > msleep(10); > } > } > > - > > With the reproducer above, I was able to: > > 1. Validate that on chromeos-5.15 I could revert ${SUBJECT} patch and > the problem went away. > > 2. I could go to mainline at exactly the commit hash of ${SUBJECT} > patch, see the problem, then revert ${SUBJECT} patch and see the > problem go away. > > Thus I'm fairly confident that the problem is related to ${SUBJECT} patch. > > - > > I'm hoping that someone on this thread can propose a fix. I'm happy to > test, but I was hoping not to have to become an expert on the rwsem > implementation to try to figure out the proper fix. > See if it makes sense to only allow the first waiter to spin on owner. Hillf --- mainline/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -337,7 +337,7 @@ struct rwsem_waiter { unsigned long timeout; /* Writer only, not initialized in reader */ - bool handoff_set; + bool handoff_set, first; }; #define rwsem_first_waiter(sem) \ list_first_entry(&sem->wait_list, struct rwsem_waiter, list) @@ -604,6 +604,7 @@ static inline bool rwsem_try_write_lock( lockdep_assert_held(&sem->wait_lock); + waiter->first = first; count = atomic_long_read(&sem->count); do { bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF); @@ -1114,6 +1115,7 @@ rwsem_down_write_slowpath(struct rw_sema waiter.type = RWSEM_WAITING_FOR_WRITE; waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT; waiter.handoff_set = false; + waiter.first = false; raw_spin_lock_irq(&sem->wait_lock); rwsem_add_waiter(sem, &waiter); @@ -1158,7 +1160,7 @@ rwsem_down_write_slowpath(struct rw_sema * In this case, we attempt to acquire the lock again * without sleeping. */ - if (waiter.handoff_set) { + if (waiter.handoff_set && waiter.first) { enum owner_state owner_state; preempt_disable();