On Mon, Jul 15, 2019 at 05:30:01PM -0400, Waiman Long wrote: > On 7/15/19 3:25 PM, Alex Kogan wrote: > > /* > > - * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in > > - * size and four of them will fit nicely in one 64-byte cacheline. For > > - * pvqspinlock, however, we need more space for extra data. To accommodate > > - * that, we insert two more long words to pad it up to 32 bytes. IOW, only > > - * two of them can fit in a cacheline in this case. That is OK as it is rare > > - * to have more than 2 levels of slowpath nesting in actual use. We don't > > - * want to penalize pvqspinlocks to optimize for a rare case in native > > - * qspinlocks. > > + * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in > > + * size. For pvqspinlock or the NUMA-aware variant, however, we need more > > + * space for extra data. To accommodate that, we insert two more long words > > + * to pad it up to 36 bytes. > > */ > The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the > mcs_spinlock structure is 8-byte aligned. For better cacheline > alignment, I will like to keep mcs_spinlock to 16 bytes as before. > Instead, you can use encode_tail() to store the CNA node pointer in > "locked". For instance, use (encode_tail() << 1) in locked to > distinguish it from the regular locked=1 value. Yes, please don't bloat this. I already don't like what Waiman did for the paravirt case, but this is horrible.