Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> · Wed, 15 Nov 2023 12:04:01 -0800

Hi Peter,

On Wed, 15 Nov 2023 13:56:24 +0100, Peter Zijlstra <peterz@xxxxxxxxxxxxx>
wrote:

> On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote:
> 
> > +static __always_inline inline void handle_pending_pir(struct pi_desc
> > *pid, struct pt_regs *regs) +{  
> 
> __always_inline means that... (A)
> 
> > +	int i, vec = FIRST_EXTERNAL_VECTOR;
> > +	u64 pir_copy[4];
> > +
> > +	/*
> > +	 * Make a copy of PIR which contains IRQ pending bits for
> > vectors,
> > +	 * then invoke IRQ handlers for each pending vector.
> > +	 * If any new interrupts were posted while we are processing,
> > will
> > +	 * do again before allowing new notifications. The idea is to
> > +	 * minimize the number of the expensive notifications if IRQs
> > come
> > +	 * in a high frequency burst.
> > +	 */
> > +	for (i = 0; i < 4; i++)
> > +		pir_copy[i] = raw_atomic64_xchg((atomic64_t
> > *)&pid->pir_l[i], 0); +
> > +	/*
> > +	 * Ideally, we should start from the high order bits set in
> > the PIR
> > +	 * since each bit represents a vector. Higher order bit
> > position means
> > +	 * the vector has higher priority. But external vectors are
> > allocated
> > +	 * based on availability not priority.
> > +	 *
> > +	 * EOI is included in the IRQ handlers call to apic_ack_irq,
> > which
> > +	 * allows higher priority system interrupt to get in between.
> > +	 */
> > +	for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)
> > +		call_irq_handler(vec, regs);
> > +
> > +}
> > +
> > +/*
> > + * Performance data shows that 3 is good enough to harvest 90+% of the
> > benefit
> > + * on high IRQ rate workload.
> > + * Alternatively, could make this tunable, use 3 as default.
> > + */
> > +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> > +
> > +/*
> > + * For MSIs that are delivered as posted interrupts, the CPU
> > notifications
> > + * can be coalesced if the MSIs arrive in high frequency bursts.
> > + */
> > +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> > +{
> > +	struct pt_regs *old_regs = set_irq_regs(regs);
> > +	struct pi_desc *pid;
> > +	int i = 0;
> > +
> > +	pid = this_cpu_ptr(&posted_interrupt_desc);
> > +
> > +	inc_irq_stat(posted_msi_notification_count);
> > +	irq_enter();
> > +
> > +	while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
> > +		handle_pending_pir(pid, regs);
> > +
> > +		/*
> > +		 * If there are new interrupts posted in PIR, do
> > again. If
> > +		 * nothing pending, no need to wait for more
> > interrupts.
> > +		 */
> > +		if (is_pir_pending(pid))  
> 
> So this reads those same 4 words we xchg in handle_pending_pir(), right?
> 
> > +			continue;
> > +		else
> > +			break;
> > +	}
> > +
> > +	/*
> > +	 * Clear outstanding notification bit to allow new IRQ
> > notifications,
> > +	 * do this last to maximize the window of interrupt coalescing.
> > +	 */
> > +	pi_clear_on(pid);
> > +
> > +	/*
> > +	 * There could be a race of PI notification and the clearing
> > of ON bit,
> > +	 * process PIR bits one last time such that handling the new
> > interrupts
> > +	 * are not delayed until the next IRQ.
> > +	 */
> > +	if (unlikely(is_pir_pending(pid)))
> > +		handle_pending_pir(pid, regs);  
> 
> (A) ... we get _two_ copies of that thing in this function. Does that
> make sense ?
> 
> > +
> > +	apic_eoi();
> > +	irq_exit();
> > +	set_irq_regs(old_regs);
> > +}
> >  #endif /* X86_POSTED_MSI */  
> 
> Would it not make more sense to write things something like:
> 
it is a great idea, we can save expensive xchg if pir[i] is 0. But I have
to tweak a little to let it perform better.

> bool handle_pending_pir()
> {
> 	bool handled = false;
> 	u64 pir_copy[4];
> 
> 	for (i = 0; i < 4; i++) {
> 		if (!pid-pir_l[i]) {
> 			pir_copy[i] = 0;
> 			continue;
> 		}
> 
> 		pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
we are interleaving cacheline read and xchg. So made it to

	for (i = 0; i < 4; i++) {
		pir_copy[i] = pid->pir_l[i];
	}

	for (i = 0; i < 4; i++) {
		if (pir_copy[i]) {
			pir_copy[i] = arch_xchg(&pid->pir_l[i], 0);
			handled = true;
		}
	}

With DSA MEMFILL test just one queue one MSI, we are saving 3 xchg per loop.
Here is the performance comparison in IRQ rate:

Original RFC 9.29 m/sec, 
Optimized in your email 8.82m/sec,
Tweaked above: 9.54m/s

I need to test with more MSI vectors spreading out to all 4 u64. I suspect
the benefit will decrease since we need to do both read and xchg for
non-zero entries.

> 		handled |= true;
> 	}

> 
> 	if (!handled)
> 		return handled;
> 
> 	for_each_set_bit()
> 		....
> 
> 	return handled.
> }
> 
> sysvec_posted_blah_blah()
> {
> 	bool done = false;
> 	bool handled;
> 
> 	for (;;) {
> 		handled = handle_pending_pir();
> 		if (done)
> 			break;
> 		if (!handled || ++loops > MAX_LOOPS) {
> 			pi_clear_on(pid);
> 			/* once more after clear_on */
> 			done = true;
> 		}
> 	}
> }
> 
> 
> Hmm?

Thanks,

Jacob