On Tue, Nov 06, 2018 at 01:47:21PM +0100, Henrik Austad wrote: > From: Xunlei Pang <xlpang@xxxxxxxxxx> > > On some of our systems, we notice this error popping up on occasion, > completely hanging the system. > > [<ffffffc0000ee398>] enqueue_task_dl+0x1f0/0x420 > [<ffffffc0000d0f14>] activate_task+0x7c/0x90 > [<ffffffc0000edbdc>] push_dl_task+0x164/0x1c8 > [<ffffffc0000edc60>] push_dl_tasks+0x20/0x30 > [<ffffffc0000cc00c>] __balance_callback+0x44/0x68 > [<ffffffc000d2c018>] __schedule+0x6f0/0x728 > [<ffffffc000d2c278>] schedule+0x78/0x98 > [<ffffffc000d2e76c>] __rt_mutex_slowlock+0x9c/0x108 > [<ffffffc000d2e9d0>] rt_mutex_slowlock+0xd8/0x198 > [<ffffffc0000f7f28>] rt_mutex_timed_futex_lock+0x30/0x40 > [<ffffffc00012c1a8>] futex_lock_pi+0x200/0x3b0 > [<ffffffc00012cf84>] do_futex+0x1c4/0x550 > > It runs an 4.4 kernel on an arm64 rig. The signature looks suspciously > similar to what Xuneli Pang observed in his crash, and with this fix, my > issue goes away (my system has survivied approx 1500 reboots and a few > nasty tests so far) > > Alongside this patch in the tree, there are a few other bits and pieces > pertaining to futex, rtmutex and kernel/sched/, but those patches > creates > weird crashes that I have not been able to dissect yet. Once (if) I have > been able to figure those out (and test), they will be sent later. > > I am sure other users of LTS that also use sched_deadline will run into > this issue, so I think it is a good candidate for 4.4-stable. Possibly > also > to 4.9 and 4.14, but I have not had time to test for those versions. But this patch relies on: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter") for pointer stability, but that patch in turn relies on the whole FUTEX_UNLOCK_PI patch set: $ git log --oneline 499f5aca2cdd5e958b27e2655e7e7f82524f46b1..56222b212e8edb1cf51f5dd73ff645809b082b40 56222b212e8e futex: Drop hb->lock before enqueueing on the rtmutex bebe5b514345 futex: Futex_unlock_pi() determinism cfafcd117da0 futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() 38d589f2fd08 futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() 50809358dd71 futex,rt_mutex: Introduce rt_mutex_init_waiter() 16ffa12d7425 futex: Pull rt_mutex_futex_unlock() out from under hb->lock 73d786bd043e futex: Rework inconsistent rt_mutex/futex_q state bf92cf3a5100 futex: Cleanup refcounting 734009e96d19 futex: Change locking rules 5293c2efda37 futex,rt_mutex: Provide futex specific rt_mutex API fffa954fb528 futex: Remove rt_mutex_deadlock_account_*() 1b367ece0d7e futex: Use smp_store_release() in mark_wake_futex() and possibly some follow-up fixes on that (I have vague memories of that). As is, just the one patch you propose isn't correct :/ Yes, that was a ginormous amount of work to fix a seemingly simple splat :-(