Hi Peter, On Wed, 15 Nov 2023 13:56:24 +0100, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote: > > > +static __always_inline inline void handle_pending_pir(struct pi_desc > > *pid, struct pt_regs *regs) +{ > > __always_inline means that... (A) > > > + int i, vec = FIRST_EXTERNAL_VECTOR; > > + u64 pir_copy[4]; > > + > > + /* > > + * Make a copy of PIR which contains IRQ pending bits for > > vectors, > > + * then invoke IRQ handlers for each pending vector. > > + * If any new interrupts were posted while we are processing, > > will > > + * do again before allowing new notifications. The idea is to > > + * minimize the number of the expensive notifications if IRQs > > come > > + * in a high frequency burst. > > + */ > > + for (i = 0; i < 4; i++) > > + pir_copy[i] = raw_atomic64_xchg((atomic64_t > > *)&pid->pir_l[i], 0); + > > + /* > > + * Ideally, we should start from the high order bits set in > > the PIR > > + * since each bit represents a vector. Higher order bit > > position means > > + * the vector has higher priority. But external vectors are > > allocated > > + * based on availability not priority. > > + * > > + * EOI is included in the IRQ handlers call to apic_ack_irq, > > which > > + * allows higher priority system interrupt to get in between. > > + */ > > + for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256) > > + call_irq_handler(vec, regs); > > + > > +} > > + > > +/* > > + * Performance data shows that 3 is good enough to harvest 90+% of the > > benefit > > + * on high IRQ rate workload. > > + * Alternatively, could make this tunable, use 3 as default. > > + */ > > +#define MAX_POSTED_MSI_COALESCING_LOOP 3 > > + > > +/* > > + * For MSIs that are delivered as posted interrupts, the CPU > > notifications > > + * can be coalesced if the MSIs arrive in high frequency bursts. > > + */ > > +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification) > > +{ > > + struct pt_regs *old_regs = set_irq_regs(regs); > > + struct pi_desc *pid; > > + int i = 0; > > + > > + pid = this_cpu_ptr(&posted_interrupt_desc); > > + > > + inc_irq_stat(posted_msi_notification_count); > > + irq_enter(); > > + > > + while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) { > > + handle_pending_pir(pid, regs); > > + > > + /* > > + * If there are new interrupts posted in PIR, do > > again. If > > + * nothing pending, no need to wait for more > > interrupts. > > + */ > > + if (is_pir_pending(pid)) > > So this reads those same 4 words we xchg in handle_pending_pir(), right? > > > + continue; > > + else > > + break; > > + } > > + > > + /* > > + * Clear outstanding notification bit to allow new IRQ > > notifications, > > + * do this last to maximize the window of interrupt coalescing. > > + */ > > + pi_clear_on(pid); > > + > > + /* > > + * There could be a race of PI notification and the clearing > > of ON bit, > > + * process PIR bits one last time such that handling the new > > interrupts > > + * are not delayed until the next IRQ. > > + */ > > + if (unlikely(is_pir_pending(pid))) > > + handle_pending_pir(pid, regs); > > (A) ... we get _two_ copies of that thing in this function. Does that > make sense ? > > > + > > + apic_eoi(); > > + irq_exit(); > > + set_irq_regs(old_regs); > > +} > > #endif /* X86_POSTED_MSI */ > > Would it not make more sense to write things something like: > it is a great idea, we can save expensive xchg if pir[i] is 0. But I have to tweak a little to let it perform better. > bool handle_pending_pir() > { > bool handled = false; > u64 pir_copy[4]; > > for (i = 0; i < 4; i++) { > if (!pid-pir_l[i]) { > pir_copy[i] = 0; > continue; > } > > pir_copy[i] = arch_xchg(&pir->pir_l[i], 0); we are interleaving cacheline read and xchg. So made it to for (i = 0; i < 4; i++) { pir_copy[i] = pid->pir_l[i]; } for (i = 0; i < 4; i++) { if (pir_copy[i]) { pir_copy[i] = arch_xchg(&pid->pir_l[i], 0); handled = true; } } With DSA MEMFILL test just one queue one MSI, we are saving 3 xchg per loop. Here is the performance comparison in IRQ rate: Original RFC 9.29 m/sec, Optimized in your email 8.82m/sec, Tweaked above: 9.54m/s I need to test with more MSI vectors spreading out to all 4 u64. I suspect the benefit will decrease since we need to do both read and xchg for non-zero entries. > handled |= true; > } > > if (!handled) > return handled; > > for_each_set_bit() > .... > > return handled. > } > > sysvec_posted_blah_blah() > { > bool done = false; > bool handled; > > for (;;) { > handled = handle_pending_pir(); > if (done) > break; > if (!handled || ++loops > MAX_LOOPS) { > pi_clear_on(pid); > /* once more after clear_on */ > done = true; > } > } > } > > > Hmm? Thanks, Jacob