On Fri, Jan 31, 2014 at 11:17:29AM +0100, Peter Zijlstra wrote: > On Fri, Jan 31, 2014 at 05:03:48AM -0500, George Spelvin wrote: > > How about getting rid of that TICKET_MSB mess and doing something like: > > > > #define TICKET_MASK 0xFFFF > > > > static inline void ticket_spin_unlock(atomic_t *tickets) > > { > > u32 t = *tickets; > > > > smp_mb__before_atomic_inc(); > > > > /* Increment the low 16 bits without affecting the upper. */ > > if (unlikely((~t & TICKET_MASK) == 0)) > > atomic_add(-(atomic_t)TICKET_MASK, tickets); > > else > > atomic_inc(tickets); > > } > > > > That also allows up to 2^16 waiters, up from 2^15. > > (Minus one on both cases, if you want to be fussy.) > > Ah indeed. That'll work. That said, any arch that can single-copy > address shorts can probably do better than this generic atomic_t thing. > > My main point was that we should seriously look at a ticket lock instead > of the MCS one, because while the MCS has better contention behaviour, > we shouldn't optimize locks for the worst contention. We do need to articulate what we -are- optimizing for, or more generally, what we are trying to accomplish with these locking primitives. Here are some of the traditional goals: 1. Avoid performance collapse under heavy load. Here the goal is to have throughput level off rather than decreasing when load exceeds saturation levels. 2. Avoid starvation of requesters. Here the goal is a very minimal level of fairness. 3. Achieve some reasonable level of fairness if the underlying hardware provides fair access to memory. Here "reasonable" might mean that lock-grant probabilities not differ by more than (say) a factor of two. 4. Achieve some reasonable level of fairness even if the underlying hardware is unfair. Rumor has it that some of the very large systems in use today do not guarantee fair access to cache lines under heavy memory contention. 5. Achieve some minimal level of scalability, so that throughput increases monotonically with increasing load. Preferably without penalizing small-system performance. 6. Achieve some reasonable level of scalability, so that adding a given CPU provides at least some fraction (say, 80%) of one CPU's worth of incremental throughput. 7. Achieve linear scalability. Ticket locking has problems with #1 on some systems due to the thundering herd of reads following each invalidation. (Yes, it is not as bad as the atomic-operation thundering herd associated with test-and-set spinlocks, but it can neverthelless be a problem on larger systems.) Achieving #4 requires some sort of queued lock as far as I can see. Although you -might- get #5 with a very clever NUMA-aware lock, #6 and #7 will require some level of redesign. Fancy locks can help if the common code path is fully parallelized, but where some rare error path is not worth the effort. In this case, using a fancy lock on the error path can at least keep the system moving during the time that the error condition is in force, and hopefully permit returning to full throughput once the error condition is gone. So, what exactly are we trying to achieve with all this? ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html