Re: [PATCH 09/15] x86/irq: Install posted MSI notification handler

Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> · Tue, 2 Apr 2024 19:43:55 -0700

Hi Zeng,

On Fri, 29 Mar 2024 15:32:00 +0800, Zeng Guang <guang.zeng@xxxxxxxxx> wrote:

> On 1/27/2024 7:42 AM, Jacob Pan wrote:
> > @@ -353,6 +360,111 @@ void intel_posted_msi_init(void)
> >   	pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
> >   	pid->ndst = this_cpu_read(x86_cpu_to_apicid);
> >   }
> > +
> > +/*
> > + * De-multiplexing posted interrupts is on the performance path, the
> > code
> > + * below is written to optimize the cache performance based on the
> > following
> > + * considerations:
> > + * 1.Posted interrupt descriptor (PID) fits in a cache line that is
> > frequently
> > + *   accessed by both CPU and IOMMU.
> > + * 2.During posted MSI processing, the CPU needs to do 64-bit read and
> > xchg
> > + *   for checking and clearing posted interrupt request (PIR), a 256
> > bit field
> > + *   within the PID.
> > + * 3.On the other side, the IOMMU does atomic swaps of the entire PID
> > cache
> > + *   line when posting interrupts and setting control bits.
> > + * 4.The CPU can access the cache line a magnitude faster than the
> > IOMMU.
> > + * 5.Each time the IOMMU does interrupt posting to the PIR will evict
> > the PID
> > + *   cache line. The cache line states after each operation are as
> > follows:
> > + *   CPU		IOMMU			PID Cache line
> > state
> > + *   ---------------------------------------------------------------
> > + *...read64					exclusive
> > + *...lock xchg64				modified
> > + *...			post/atomic swap	invalid
> > + *...-------------------------------------------------------------
> > + *
> > + * To reduce L1 data cache miss, it is important to avoid contention
> > with
> > + * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is
> > used
> > + * to dispatch interrupt handlers.
> > + *
> > + * In addition, the code is trying to keep the cache line state
> > consistent
> > + * as much as possible. e.g. when making a copy and clearing the PIR
> > + * (assuming non-zero PIR bits are present in the entire PIR), it does:
> > + *		read, read, read, read, xchg, xchg, xchg, xchg
> > + * instead of:
> > + *		read, xchg, read, xchg, read, xchg, read, xchg
> > + */
> > +static __always_inline inline bool handle_pending_pir(u64 *pir, struct
> > pt_regs *regs) +{
> > +	int i, vec = FIRST_EXTERNAL_VECTOR;
> > +	unsigned long pir_copy[4];
> > +	bool handled = false;
> > +
> > +	for (i = 0; i < 4; i++)
> > +		pir_copy[i] = pir[i];
> > +
> > +	for (i = 0; i < 4; i++) {
> > +		if (!pir_copy[i])
> > +			continue;
> > +
> > +		pir_copy[i] = arch_xchg(pir, 0);  
> 
> Here is a problem that pir_copy[i] will always be written as pir[0]. 
> This leads to handle spurious posted MSIs later.
Yes, you are right. It should be
pir_copy[i] = arch_xchg(&pir[i], 0);

Will fix in v2, really appreciated.

> > +		handled = true;
> > +	}
> > +
> > +	if (handled) {
> > +		for_each_set_bit_from(vec, pir_copy,
> > FIRST_SYSTEM_VECTOR)
> > +			call_irq_handler(vec, regs);
> > +	}
> > +
> > +	return handled;
> > +}
> > +
> > +/*
> > + * Performance data shows that 3 is good enough to harvest 90+% of the
> > benefit
> > + * on high IRQ rate workload.
> > + */
> > +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> > +
> > +/*
> > + * For MSIs that are delivered as posted interrupts, the CPU
> > notifications
> > + * can be coalesced if the MSIs arrive in high frequency bursts.
> > + */
> > +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> > +{
> > +	struct pt_regs *old_regs = set_irq_regs(regs);
> > +	struct pi_desc *pid;
> > +	int i = 0;
> > +
> > +	pid = this_cpu_ptr(&posted_interrupt_desc);
> > +
> > +	inc_irq_stat(posted_msi_notification_count);
> > +	irq_enter();
> > +
> > +	/*
> > +	 * Max coalescing count includes the extra round of
> > handle_pending_pir
> > +	 * after clearing the outstanding notification bit. Hence, at
> > most
> > +	 * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
> > +	 */
> > +	while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
> > +		if (!handle_pending_pir(pid->pir64, regs))
> > +			break;
> > +	}
> > +
> > +	/*
> > +	 * Clear outstanding notification bit to allow new IRQ
> > notifications,
> > +	 * do this last to maximize the window of interrupt coalescing.
> > +	 */
> > +	pi_clear_on(pid);
> > +
> > +	/*
> > +	 * There could be a race of PI notification and the clearing
> > of ON bit,
> > +	 * process PIR bits one last time such that handling the new
> > interrupts
> > +	 * are not delayed until the next IRQ.
> > +	 */
> > +	handle_pending_pir(pid->pir64, regs);
> > +
> > +	apic_eoi();
> > +	irq_exit();
> > +	set_irq_regs(old_regs);
> >   }
> >   #endif /* X86_POSTED_MSI */
> >     

Thanks,

Jacob