On Mon, Oct 14, 2024 at 12:15:11PM -0400, Liang, Kan wrote: > > > On 2024-10-14 7:59 a.m., Peter Zijlstra wrote: > > On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote: > > > >> The original implementation is by design having a terrible performance > >> overhead, ie., every PMU context switch at runtime requires a SRCU > >> lock pair and pmu list traversal. To reduce the overhead, we put > >> "passthrough" pmus in the front of the list and quickly exit the pmu > >> traversal when we just pass the last "passthrough" pmu. > > > > What was the expensive bit? The SRCU memory barrier or the list > > iteration? How long is that list really? > > Both. But I don't think there is any performance data. > > The length of the list could vary on different platforms. For a modern > server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs, > IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of > devices. The number could keep increasing with more and more devices > supporting the PMU capability. > > Two methods were considered. > - One is to add a global variable to track the "passthrough" pmu. The > idea assumes that there is only one "passthrough" pmu that requires the > switch, and the situation will not be changed in the near feature. > So the SRCU memory barrier and the list iteration can be avoided. > It's implemented in the patch > > - The other one is always put the "passthrough" pmus in the front of the > list. So the unnecessary list iteration can be avoided. It does nothing > for the SRCU lock pair. PaulMck has patches that introduce srcu_read_lock_lite(), which would avoid the smp_mb() in most cases. https://lkml.kernel.org/r/20241011173931.2050422-6-paulmck@xxxxxxxxxx We can also keep a second list, just for the passthrough pmus. A bit like sched_cb_list.