Hello, Ben. On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote: > One pattern I notice repeating for at least most of the hangs is that all but one > CPU thread has irqs disabled and is in state 2. But, there will be one thread > in state 1 that still has IRQs enabled and it is reported to be in soft-lockup > instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing, > but typically that of the sysrq itself. I added printk that would always > print if the thread notices that smdata->state != curstate, and the soft-lockup > thread (cpu 2 below) never shows that message. It sounds like one of the cpus get live-locked by IRQs. I can't tell why the situation is made worse by other CPUs being tied up. Do you ever see CPUs being live locked by IRQs during normal operation? > I thought it might be because it was reading stale smdata->state, so I changed > that to atomic_t hoping that would mitigate that. I also tried adding smp_rmb() > below the cpu_relax(). Neither had any affect, so I am left assuming that the I looked at the code again and the memory accesses seem properly interlocked. It's a bit tricky and should probably have used spinlock instead considering it's already a hugely expensive path anyway, but it does seem correct to me. > thread instead is stuck handling IRQs and never gets out of the IRQ handler. Seems that way to me too. > Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores, > the remaining process can just never handle all the IRQs and get back to the > cpu shutdown state machine? The various soft-hang stacks below show at least slightly > different stacks, so I assume that thread is doing at least something. What's the source of all those IRQs tho? I don't think the IRQs are from actual events. The system is quiesced. Even if it's from receiving packets, it's gonna quiet down pretty quickly. The hang doesn't go away if you disconnect the network cable while hung, right? What could be happening is that IRQ handling is handled by a thread but the IRQ handler itself doesn't clear the IRQ properly and depends on the handling thread to clear the condition. If no CPU is available for scheduling, it might end up raising and re-reraising IRQs for the same condition without ever being handled. If that's the case, such lockup could happen on a normally functioning UP machine or if the IRQ is pinned to a single CPU which happens to be running the handling thread. At any rate, it'd be a plain live-lock bug on the driver side. Can you please try to confirm the specific interrupt being continuously raised? Detecting the hang shouldn't be too difficult. Just recording the starting jiffies and if progress hasn't been made for, say, ten seconds, it can set a flag and then print the IRQs being handled if the flag is set. If it indeed is the ath device, we probably wanna get the driver maintainer involved. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html