Re: [PATCH] perf/intel: Remove Perfmon-v4 counter_freezing support

Stephane Eranian <eranian@xxxxxxxxxx> · Tue, 10 Nov 2020 12:52:04 -0800

On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> > > The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> > > platforms can provide an architectural state of the instruction executed
> > > after the instruction that caused the event. This patch set enables the
> > > the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> > > The Linux guest can use PEBS feature like native:
> > >
> > >   # perf record -e instructions:ppp ./br_instr a
> > >   # perf record -c 100000 -e instructions:pp ./br_instr a
> > >
> > > If the counter_freezing is not enabled on the host, the guest PEBS will
> > > be disabled on purpose when host is using PEBS facility. By default,
> > > KVM disables the co-existence of guest PEBS and host PEBS.
> >
> > Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
> > me go delete all that code.
>
> ---
> Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
>
> Perfmon-v4 counter freezing is fundamentally broken; remove this default
> disabled code to make sure nobody uses it.
>
> The feature is called Freeze-on-PMI in the SDM, and if it would do that,
> there wouldn't actually be a problem, *however* it does something subtly
> different. It globally disables the whole PMU when it raises the PMI,
> not when the PMI hits.
>
> This means there's a window between the PMI getting raised and the PMI
> actually getting served where we loose events and this violates the
> perf counter independence. That is, a counting event should not result
> in a different event count when there is a sampling event co-scheduled.
>

What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
That, in itself, is a problem. I agree with you on that point.

However, there are use cases for both modes.

I can sample on event A and count on B, C and when A overflows, I want
to snapshot B, C.
For that I want B, C at the moment of the overflow, not at the moment
the PMI is delivered. Thus, youd
would want the Freeze-on-overflow behavior. You can collect in this
mode with the perf tool,
IIRC: perf record -e '{cycles,instructions,branches:S}' ....

The other usage model is that of the replay-debugger (rr) which you are alluding
to, which needs precise count of an event including during the skid
window. For that, you need
Freeze-on-PMI (delivered). Note that this tool likely only cares about
user level occurrences of events.

As for counter independence, I am not sure it holds in all cases. If
the events are setup for user+kernel
then, as soon as you co-schedule a sampling event, you will likely get
more counts on the counting
event due to the additional kernel entries/exits caused by
interrupt-based profiling. Even if you were to
restrict to user level only, I would expect to see a few more counts.

> This is known to break existing software.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
>  arch/x86/events/intel/core.c | 152 -------------------------------------------
>  arch/x86/events/perf_event.h |   3 +-
>  2 files changed, 1 insertion(+), 154 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index c79748f6921d..9909dfa6fb12 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2121,18 +2121,6 @@ static void intel_tfa_pmu_enable_all(int added)
>         intel_pmu_enable_all(added);
>  }
>
> -static void enable_counter_freeze(void)
> -{
> -       update_debugctlmsr(get_debugctlmsr() |
> -                       DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
> -static void disable_counter_freeze(void)
> -{
> -       update_debugctlmsr(get_debugctlmsr() &
> -                       ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
>  static inline u64 intel_pmu_get_status(void)
>  {
>         u64 status;
> @@ -2696,95 +2684,6 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>         return handled;
>  }
>
> -static bool disable_counter_freezing = true;
> -static int __init intel_perf_counter_freezing_setup(char *s)
> -{
> -       bool res;
> -
> -       if (kstrtobool(s, &res))
> -               return -EINVAL;
> -
> -       disable_counter_freezing = !res;
> -       return 1;
> -}
> -__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
> -
> -/*
> - * Simplified handler for Arch Perfmon v4:
> - * - We rely on counter freezing/unfreezing to enable/disable the PMU.
> - * This is done automatically on PMU ack.
> - * - Ack the PMU only after the APIC.
> - */
> -
> -static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
> -{
> -       struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> -       int handled = 0;
> -       bool bts = false;
> -       u64 status;
> -       int pmu_enabled = cpuc->enabled;
> -       int loops = 0;
> -
> -       /* PMU has been disabled because of counter freezing */
> -       cpuc->enabled = 0;
> -       if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
> -               bts = true;
> -               intel_bts_disable_local();
> -               handled = intel_pmu_drain_bts_buffer();
> -               handled += intel_bts_interrupt();
> -       }
> -       status = intel_pmu_get_status();
> -       if (!status)
> -               goto done;
> -again:
> -       intel_pmu_lbr_read();
> -       if (++loops > 100) {
> -               static bool warned;
> -
> -               if (!warned) {
> -                       WARN(1, "perfevents: irq loop stuck!\n");
> -                       perf_event_print_debug();
> -                       warned = true;
> -               }
> -               intel_pmu_reset();
> -               goto done;
> -       }
> -
> -
> -       handled += handle_pmi_common(regs, status);
> -done:
> -       /* Ack the PMI in the APIC */
> -       apic_write(APIC_LVTPC, APIC_DM_NMI);
> -
> -       /*
> -        * The counters start counting immediately while ack the status.
> -        * Make it as close as possible to IRET. This avoids bogus
> -        * freezing on Skylake CPUs.
> -        */
> -       if (status) {
> -               intel_pmu_ack_status(status);
> -       } else {
> -               /*
> -                * CPU may issues two PMIs very close to each other.
> -                * When the PMI handler services the first one, the
> -                * GLOBAL_STATUS is already updated to reflect both.
> -                * When it IRETs, the second PMI is immediately
> -                * handled and it sees clear status. At the meantime,
> -                * there may be a third PMI, because the freezing bit
> -                * isn't set since the ack in first PMI handlers.
> -                * Double check if there is more work to be done.
> -                */
> -               status = intel_pmu_get_status();
> -               if (status)
> -                       goto again;
> -       }
> -
> -       if (bts)
> -               intel_bts_enable_local();
> -       cpuc->enabled = pmu_enabled;
> -       return handled;
> -}
> -
>  /*
>   * This handler is triggered by the local APIC, so the APIC IRQ handling
>   * rules apply:
> @@ -4081,9 +3980,6 @@ static void intel_pmu_cpu_starting(int cpu)
>         if (x86_pmu.version > 1)
>                 flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
>
> -       if (x86_pmu.counter_freezing)
> -               enable_counter_freeze();
> -
>         /* Disable perf metrics if any added CPU doesn't support it. */
>         if (x86_pmu.intel_cap.perf_metrics) {
>                 union perf_capabilities perf_cap;
> @@ -4154,9 +4050,6 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
>  static void intel_pmu_cpu_dying(int cpu)
>  {
>         fini_debug_store_on_cpu(cpu);
> -
> -       if (x86_pmu.counter_freezing)
> -               disable_counter_freeze();
>  }
>
>  void intel_cpuc_finish(struct cpu_hw_events *cpuc)
> @@ -4548,39 +4441,6 @@ static __init void intel_nehalem_quirk(void)
>         }
>  }
>
> -static const struct x86_cpu_desc counter_freezing_ucodes[] = {
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,         2, 0x0000000e),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,         9, 0x0000002e),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT,        10, 0x00000008),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_D,       1, 0x00000028),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS,    1, 0x00000028),
> -       INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS,    8, 0x00000006),
> -       {}
> -};
> -
> -static bool intel_counter_freezing_broken(void)
> -{
> -       return !x86_cpu_has_min_microcode_rev(counter_freezing_ucodes);
> -}
> -
> -static __init void intel_counter_freezing_quirk(void)
> -{
> -       /* Check if it's already disabled */
> -       if (disable_counter_freezing)
> -               return;
> -
> -       /*
> -        * If the system starts with the wrong ucode, leave the
> -        * counter-freezing feature permanently disabled.
> -        */
> -       if (intel_counter_freezing_broken()) {
> -               pr_info("PMU counter freezing disabled due to CPU errata,"
> -                       "please upgrade microcode\n");
> -               x86_pmu.counter_freezing = false;
> -               x86_pmu.handle_irq = intel_pmu_handle_irq;
> -       }
> -}
> -
>  /*
>   * enable software workaround for errata:
>   * SNB: BJ122
> @@ -4966,9 +4826,6 @@ __init int intel_pmu_init(void)
>                         max((int)edx.split.num_counters_fixed, assume);
>         }
>
> -       if (version >= 4)
> -               x86_pmu.counter_freezing = !disable_counter_freezing;
> -
>         if (boot_cpu_has(X86_FEATURE_PDCM)) {
>                 u64 capabilities;
>
> @@ -5090,7 +4947,6 @@ __init int intel_pmu_init(void)
>
>         case INTEL_FAM6_ATOM_GOLDMONT:
>         case INTEL_FAM6_ATOM_GOLDMONT_D:
> -               x86_add_quirk(intel_counter_freezing_quirk);
>                 memcpy(hw_cache_event_ids, glm_hw_cache_event_ids,
>                        sizeof(hw_cache_event_ids));
>                 memcpy(hw_cache_extra_regs, glm_hw_cache_extra_regs,
> @@ -5117,7 +4973,6 @@ __init int intel_pmu_init(void)
>                 break;
>
>         case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
> -               x86_add_quirk(intel_counter_freezing_quirk);
>                 memcpy(hw_cache_event_ids, glp_hw_cache_event_ids,
>                        sizeof(hw_cache_event_ids));
>                 memcpy(hw_cache_extra_regs, glp_hw_cache_extra_regs,
> @@ -5577,13 +5432,6 @@ __init int intel_pmu_init(void)
>                 pr_cont("full-width counters, ");
>         }
>
> -       /*
> -        * For arch perfmon 4 use counter freezing to avoid
> -        * several MSR accesses in the PMI.
> -        */
> -       if (x86_pmu.counter_freezing)
> -               x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
> -
>         if (x86_pmu.intel_cap.perf_metrics)
>                 x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 10032f023fcc..4084fa31cc21 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -681,8 +681,7 @@ struct x86_pmu {
>
>         /* PMI handler bits */
>         unsigned int    late_ack                :1,
> -                       enabled_ack             :1,
> -                       counter_freezing        :1;
> +                       enabled_ack             :1;
>         /*
>          * sysfs attrs
>          */