Hi Zeng, On Fri, 29 Mar 2024 15:32:00 +0800, Zeng Guang <guang.zeng@xxxxxxxxx> wrote: > On 1/27/2024 7:42 AM, Jacob Pan wrote: > > @@ -353,6 +360,111 @@ void intel_posted_msi_init(void) > > pid->nv = POSTED_MSI_NOTIFICATION_VECTOR; > > pid->ndst = this_cpu_read(x86_cpu_to_apicid); > > } > > + > > +/* > > + * De-multiplexing posted interrupts is on the performance path, the > > code > > + * below is written to optimize the cache performance based on the > > following > > + * considerations: > > + * 1.Posted interrupt descriptor (PID) fits in a cache line that is > > frequently > > + * accessed by both CPU and IOMMU. > > + * 2.During posted MSI processing, the CPU needs to do 64-bit read and > > xchg > > + * for checking and clearing posted interrupt request (PIR), a 256 > > bit field > > + * within the PID. > > + * 3.On the other side, the IOMMU does atomic swaps of the entire PID > > cache > > + * line when posting interrupts and setting control bits. > > + * 4.The CPU can access the cache line a magnitude faster than the > > IOMMU. > > + * 5.Each time the IOMMU does interrupt posting to the PIR will evict > > the PID > > + * cache line. The cache line states after each operation are as > > follows: > > + * CPU IOMMU PID Cache line > > state > > + * --------------------------------------------------------------- > > + *...read64 exclusive > > + *...lock xchg64 modified > > + *... post/atomic swap invalid > > + *...------------------------------------------------------------- > > + * > > + * To reduce L1 data cache miss, it is important to avoid contention > > with > > + * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is > > used > > + * to dispatch interrupt handlers. > > + * > > + * In addition, the code is trying to keep the cache line state > > consistent > > + * as much as possible. e.g. when making a copy and clearing the PIR > > + * (assuming non-zero PIR bits are present in the entire PIR), it does: > > + * read, read, read, read, xchg, xchg, xchg, xchg > > + * instead of: > > + * read, xchg, read, xchg, read, xchg, read, xchg > > + */ > > +static __always_inline inline bool handle_pending_pir(u64 *pir, struct > > pt_regs *regs) +{ > > + int i, vec = FIRST_EXTERNAL_VECTOR; > > + unsigned long pir_copy[4]; > > + bool handled = false; > > + > > + for (i = 0; i < 4; i++) > > + pir_copy[i] = pir[i]; > > + > > + for (i = 0; i < 4; i++) { > > + if (!pir_copy[i]) > > + continue; > > + > > + pir_copy[i] = arch_xchg(pir, 0); > > Here is a problem that pir_copy[i] will always be written as pir[0]. > This leads to handle spurious posted MSIs later. Yes, you are right. It should be pir_copy[i] = arch_xchg(&pir[i], 0); Will fix in v2, really appreciated. > > + handled = true; > > + } > > + > > + if (handled) { > > + for_each_set_bit_from(vec, pir_copy, > > FIRST_SYSTEM_VECTOR) > > + call_irq_handler(vec, regs); > > + } > > + > > + return handled; > > +} > > + > > +/* > > + * Performance data shows that 3 is good enough to harvest 90+% of the > > benefit > > + * on high IRQ rate workload. > > + */ > > +#define MAX_POSTED_MSI_COALESCING_LOOP 3 > > + > > +/* > > + * For MSIs that are delivered as posted interrupts, the CPU > > notifications > > + * can be coalesced if the MSIs arrive in high frequency bursts. > > + */ > > +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification) > > +{ > > + struct pt_regs *old_regs = set_irq_regs(regs); > > + struct pi_desc *pid; > > + int i = 0; > > + > > + pid = this_cpu_ptr(&posted_interrupt_desc); > > + > > + inc_irq_stat(posted_msi_notification_count); > > + irq_enter(); > > + > > + /* > > + * Max coalescing count includes the extra round of > > handle_pending_pir > > + * after clearing the outstanding notification bit. Hence, at > > most > > + * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here. > > + */ > > + while (++i < MAX_POSTED_MSI_COALESCING_LOOP) { > > + if (!handle_pending_pir(pid->pir64, regs)) > > + break; > > + } > > + > > + /* > > + * Clear outstanding notification bit to allow new IRQ > > notifications, > > + * do this last to maximize the window of interrupt coalescing. > > + */ > > + pi_clear_on(pid); > > + > > + /* > > + * There could be a race of PI notification and the clearing > > of ON bit, > > + * process PIR bits one last time such that handling the new > > interrupts > > + * are not delayed until the next IRQ. > > + */ > > + handle_pending_pir(pid->pir64, regs); > > + > > + apic_eoi(); > > + irq_exit(); > > + set_irq_regs(old_regs); > > } > > #endif /* X86_POSTED_MSI */ > > Thanks, Jacob