On Mon, Nov 15, 2021 at 10:05:05AM +0530, Bhaskara Budiredla wrote: > This driver adds support for Last-level cache tag-and-data unit > (LLC-TAD) PMU that is featured in some of the Marvell's CN10K > infrastructure silicons. > > The LLC is divided into 2N slices distributed across N Mesh tiles > in a single-socket configuration. The driver always configures the > same counter for all of the TADs. The user would end up effectively > reserving one of eight counters in every TAD to look across all TADs. > The occurrences of events are aggregated and presented to the user > at the end of an application run. The driver does not provide a way > for the user to partition TADs so that different TADs are used for > different applications. > > The event counters are zeroed to start event counting to avoid any > rollover issues. TAD perf counters are 64-bit, so it's not currently > possible to overflow event counters at current mesh and core > frequencies. > > To measure tad pmu events use perf tool stat command. For instance: > > perf stat -e tad_dat_msh_in_dss,tad_req_msh_out_any <application> > perf stat -e tad_alloc_any,tad_hit_any,tad_tag_rd <application> > > Signed-off-by: Bhaskara Budiredla <bbudiredla@xxxxxxxxxxx> > --- > drivers/perf/Kconfig | 7 + > drivers/perf/Makefile | 1 + > drivers/perf/marvell_cn10k_tad_pmu.c | 429 +++++++++++++++++++++++++++ > 3 files changed, 437 insertions(+) > create mode 100644 drivers/perf/marvell_cn10k_tad_pmu.c > > diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig > index 77522e5efe11..53b8fa554343 100644 > --- a/drivers/perf/Kconfig > +++ b/drivers/perf/Kconfig > @@ -137,6 +137,13 @@ config ARM_DMC620_PMU > Support for PMU events monitoring on the ARM DMC-620 memory > controller. > > +config MARVELL_CN10K_TAD_PMU > + tristate "Marvell CN10K LLC-TAD PMU" > + depends on ARM64 || (COMPILE_TEST && 64BIT) > + help > + Provides support for Last-Level cache Tag-and-data Units (LLC-TAD) > + performance monitors on CN10K family silicons. > + > source "drivers/perf/hisilicon/Kconfig" > > endmenu > diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile > index 5260b116c7da..2db5418d5b0a 100644 > --- a/drivers/perf/Makefile > +++ b/drivers/perf/Makefile > @@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o > obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o > obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o > obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o > +obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o > diff --git a/drivers/perf/marvell_cn10k_tad_pmu.c b/drivers/perf/marvell_cn10k_tad_pmu.c > new file mode 100644 > index 000000000000..250dd4c52d70 > --- /dev/null > +++ b/drivers/perf/marvell_cn10k_tad_pmu.c > @@ -0,0 +1,429 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Marvell CN10K LLC-TAD perf driver > + * > + * Copyright (C) 2021 Marvell > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#define pr_fmt(fmt) "tad_pmu: " fmt > + > +#include <linux/module.h> > +#include <linux/of.h> > +#include <linux/of_address.h> > +#include <linux/of_device.h> > +#include <linux/cpuhotplug.h> > +#include <linux/perf_event.h> > +#include <linux/platform_device.h> > + > +#define TAD_PFC_OFFSET 0x0 > +#define TAD_PFC(counter) (TAD_PFC_OFFSET | (counter << 3)) > +#define TAD_PRF_OFFSET 0x100 > +#define TAD_PRF(counter) (TAD_PRF_OFFSET | (counter << 3)) > +#define TAD_PRF_CNTSEL_MASK 0xFF > +#define TAD_MAX_COUNTERS 8 > + > +#define to_tad_pmu(p) (container_of(p, struct tad_pmu, pmu)) > + > +struct tad_region { > + void __iomem *base; > +}; > + > +struct tad_pmu { > + struct pmu pmu; > + struct tad_region *regions; > + u32 region_cnt; > + unsigned int cpu; > + struct hlist_node node; > + struct perf_event *events[TAD_MAX_COUNTERS]; > + DECLARE_BITMAP(counters_map, TAD_MAX_COUNTERS); > +}; > + > +static int tad_pmu_cpuhp_state; > + > +static void tad_pmu_event_counter_read(struct perf_event *event) > +{ > + struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu); > + struct hw_perf_event *hwc = &event->hw; > + u32 counter_idx = hwc->idx; > + u64 prev, new; > + int i; > + > + do { > + prev = local64_read(&hwc->prev_count); > + for (i = 0, new = 0; i < tad_pmu->region_cnt; i++) > + new += readq(tad_pmu->regions[i].base + > + TAD_PFC(counter_idx)); > + } while (local64_cmpxchg(&hwc->prev_count, prev, new) != prev); I plan to queue this as-is, but I did remark on the previous version that this loop is needlessly expensive. Why are you not using readq_relaxed() and why are you doing that _inside_ the cmpxchg() loop? Will