Re: Please add to stable: module: don't unlink the module until we've removed all exposure.

Tejun Heo <tj@xxxxxxxxxx> · Wed, 5 Jun 2013 11:48:07 -0700

Hello, Ben.

On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
> One pattern I notice repeating for at least most of the hangs is that all but one
> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
> but typically that of the sysrq itself.  I added printk that would always
> print if the thread notices that smdata->state != curstate, and the soft-lockup
> thread (cpu 2 below) never shows that message.

It sounds like one of the cpus get live-locked by IRQs.  I can't tell
why the situation is made worse by other CPUs being tied up.  Do you
ever see CPUs being live locked by IRQs during normal operation?

> I thought it might be because it was reading stale smdata->state, so I changed
> that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
> below the cpu_relax().  Neither had any affect, so I am left assuming that the

I looked at the code again and the memory accesses seem properly
interlocked.  It's a bit tricky and should probably have used spinlock
instead considering it's already a hugely expensive path anyway, but
it does seem correct to me.

> thread instead is stuck handling IRQs and never gets out of the IRQ handler.

Seems that way to me too.

> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
> the remaining process can just never handle all the IRQs and get back to the
> cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
> different stacks, so I assume that thread is doing at least something.

What's the source of all those IRQs tho?  I don't think the IRQs are
from actual events.  The system is quiesced.  Even if it's from
receiving packets, it's gonna quiet down pretty quickly.  The hang
doesn't go away if you disconnect the network cable while hung, right?

What could be happening is that IRQ handling is handled by a thread
but the IRQ handler itself doesn't clear the IRQ properly and depends
on the handling thread to clear the condition.  If no CPU is available
for scheduling, it might end up raising and re-reraising IRQs for the
same condition without ever being handled.  If that's the case, such
lockup could happen on a normally functioning UP machine or if the IRQ
is pinned to a single CPU which happens to be running the handling
thread.  At any rate, it'd be a plain live-lock bug on the driver
side.

Can you please try to confirm the specific interrupt being
continuously raised?  Detecting the hang shouldn't be too difficult.
Just recording the starting jiffies and if progress hasn't been made
for, say, ten seconds, it can set a flag and then print the IRQs being
handled if the flag is set.  If it indeed is the ath device, we
probably wanna get the driver maintainer involved.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html