> Hedi Berriche sent me a simple test case that can > trigger the failure on the siglock. Can you post the test case please. How long does it typically take to reproduce the problem? > Next, CPU 5 releases the spinlock with st2.rel, changing the lock > value to 0x0 (correct). > > SO FAR SO GOOD. > > Now, CPU 4, CPU 5 and CPU 7 all want to acquire the lock again. > Interestingly, CPU 5 and CPU 7 are both granted the same ticket, What is the duplicate ticket number that CPUs 5 & 7 get at this point? Presumably 0x0, yes? Or do they see a stale 0x7fff? > and the spinlock value (as seen from the debug fault handler) is > 0x0 after single-stepping over the fetchadd4.acq, in both cases. > CPU 4 correctly sets the spinlock value to 0x1. Is the fault handler using "ld.acq" to look at the spinlock value? If not, then this might be a red herring. [Though clearly something bad is going on here]. > Any ideas? What cpu model are you running on? What is the topological connection between CPU 4, 5 and 7 - are any of them hyper-threaded siblings? Cores on same socket? N.B. topology may change from boot to boot, so you may need to capture /proc/cpuinfo from the same boot where this problem is detected. But the variation is usually limited to which socket gets to own logical cpu 0. If this is a memory ordering problem (and that seems quite plausible) then a liberal sprinkling of "ia64_mf()" calls throughout the spinlock routines would probably make it go away. -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-ia64" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html