On 2024-10-14 7:59 a.m., Peter Zijlstra wrote: > On Mon, Sep 23, 2024 at 08:49:17PM +0200, Mingwei Zhang wrote: > >> The original implementation is by design having a terrible performance >> overhead, ie., every PMU context switch at runtime requires a SRCU >> lock pair and pmu list traversal. To reduce the overhead, we put >> "passthrough" pmus in the front of the list and quickly exit the pmu >> traversal when we just pass the last "passthrough" pmu. > > What was the expensive bit? The SRCU memory barrier or the list > iteration? How long is that list really? Both. But I don't think there is any performance data. The length of the list could vary on different platforms. For a modern server, there could be hundreds of PMUs from uncore PMUs, CXL PMUs, IOMMU PMUs, PMUs of accelerator devices and PMUs from all kinds of devices. The number could keep increasing with more and more devices supporting the PMU capability. Two methods were considered. - One is to add a global variable to track the "passthrough" pmu. The idea assumes that there is only one "passthrough" pmu that requires the switch, and the situation will not be changed in the near feature. So the SRCU memory barrier and the list iteration can be avoided. It's implemented in the patch - The other one is always put the "passthrough" pmus in the front of the list. So the unnecessary list iteration can be avoided. It does nothing for the SRCU lock pair. Thanks, Kan