On Wed, Nov 15, 2023 at 12:04:01PM -0800, Jacob Pan wrote: > we are interleaving cacheline read and xchg. So made it to Hmm, I wasn't expecting that to be a problem, but sure. > for (i = 0; i < 4; i++) { > pir_copy[i] = pid->pir_l[i]; > } > > for (i = 0; i < 4; i++) { > if (pir_copy[i]) { > pir_copy[i] = arch_xchg(&pid->pir_l[i], 0); > handled = true; > } > } > > With DSA MEMFILL test just one queue one MSI, we are saving 3 xchg per loop. > Here is the performance comparison in IRQ rate: > > Original RFC 9.29 m/sec, > Optimized in your email 8.82m/sec, > Tweaked above: 9.54m/s > > I need to test with more MSI vectors spreading out to all 4 u64. I suspect > the benefit will decrease since we need to do both read and xchg for > non-zero entries. Ah, but performance was not the reason I suggested this. Code compactness and clarity was. Possibly using less xchg is just a bonus :-)