Re: Q: Possible reason(s) for BUG in rt_spin_lock_slowlock_locked()

Andreas Glatz <andi.glatz@xxxxxxxxx> · Fri, 19 Nov 2021 13:45:49 +0000

On Fri, Nov 19, 2021 at 12:19 PM Andreas Glatz <andi.glatz@xxxxxxxxx> wrote:
>
> On Fri, Nov 19, 2021 at 11:47 AM Sebastian Andrzej Siewior
> <bigeasy@xxxxxxxxxxxxx> wrote:
> >
> > On 2021-11-19 10:39:20 [+0000], Andreas Glatz wrote:
> > > Hi
> > Hi,
> >
> > > I patched 4.19.100 with rt41 patch set and we ported the Micrel
> > > ksz8462_h Ethernet driver. The driver has one threaded IRQ triggered
> > > by the interrupt from the ksz8462 chip and two workers, one for
> > > gathering MIBs and one for checking the link status. Everything
> > > seemingly ran ok for quite some time. However, yesterday I noticed
> > > that the IRQ thread died in rt_spin_lock_slowlock_locked() as per
> > > stacktrace below at:
> > >
> > > 0xc0af82f8 is in rt_spin_lock_slowlock_locked
> > > (/usr/src/kernel/kernel/locking/rtmutex.c:1105).
> > > 1100 * unconditionally. We might have to fix that up:
> > > 1101 */
> > > 1102 fixup_rt_mutex_waiters(lock);
> > > 1103
> > > 1104 BUG_ON(rt_mutex_has_waiters(lock) && waiter == rt_mutex_top_waiter(lock));
> > > 1105 BUG_ON(!RB_EMPTY_NODE(&waiter->tree_entry));
> > > 1106 }
> > > 1107
> > > 1108 static void noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
> > > 1109 {
> > >
> > > In the jtag debugger I see that at the same time the other two
> > > kthreads are waiting on the spinlock that's held by the IRQ thread
> > > that died on two of the four CPU cores of the i.MX6q.
> > >
> > > Any ideas what might cause this and how to fix it?
> >
> > So the lock owner exploded in BUG_ON() and every lock attempt will fail
> > since the slow-path is forced and the wait_lock is still acquired.
> >
> > The BUG_ON() statement suggest that the thread is enqueued as waiter but
> > shouldn't since it obtained the lock. From your backtrace:
>
> Right... any idea for investigating why this might be? I assume a
> particular IRQ thread should be unique in the system? Maybe it didn't
> release the lock the last time it ran?

I found an instance where we did not unlock the spinlock before
returning from a function :( I'll test again...

>
> >
> > | Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP ARM
> > …
> > | CPU: 0 PID: 1457 Comm: irq/77-ksz8462_ Tainted: G  W  O      4.19.100-rt41 #1
> > …
> > | Process irq/77-ksz8462_ (pid: 1457, stack limit = 0x968e9d88)
> > | [<c0af82f8>] (rt_spin_lock_slowlock_locked) from [<c0af8384>] (rt_spin_lock_slowlock+0x64/0x94)
> > | [<c0af8384>] (rt_spin_lock_slowlock) from [<c0afab28>] (rt_spin_lock+0x7c/0x84)
> > | [<c0afab28>] (rt_spin_lock) from [<bf1c4418>] (ks_irq+0x48/0x540 [ksz8462_h])
> > | [<bf1c4418>] (ks_irq [ksz8462_h]) from [<c01933f0>] (irq_forced_thread_fn+0x30/0xa8)
> >
> > The confusing part is that you use sleeping locks but the banner says
> > PREEMPT instead of PREEMPT_RT.
> > Any chance that you don't have PREEMPT_RT_FULL enabled?
>
> I just checked the .config as well as /proc/version and it seems to be
> enabled... so yes, this is strange - thanks for pointing this out.
>
> # cat /proc/version
> Linux version 4.19.100-rt41 (oe-user@oe-host) (gcc version 8.3.0
> (GCC)) #1 SMP PREEMPT RT Mon Nov 1 15:30:04 UTC 2021
>
> >
> > > Many thanks and regards,
> > >
> > > Andreas
> > >
> > Sebastian