>-----Original Message----- >From: Will Deacon <will@xxxxxxxxxx> >Sent: Tuesday, October 26, 2021 3:14 PM >To: Bhaskara Budiredla <bbudiredla@xxxxxxxxxxx> >Cc: mark.rutland@xxxxxxx; robh+dt@xxxxxxxxxx; Bharat Bhushan ><bbhushan2@xxxxxxxxxxx>; Sunil Kovvuri Goutham ><sgoutham@xxxxxxxxxxx>; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; >devicetree@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx >Subject: [EXT] Re: [PATCH v6 1/2] drivers: perf: Add LLC-TAD perf counter >support > >External Email > >---------------------------------------------------------------------- >On Mon, Oct 18, 2021 at 09:00:56PM +0530, Bhaskara Budiredla wrote: >> This driver adds support for Last-level cache tag-and-data unit >> (LLC-TAD) PMU that is featured in some of the Marvell's CN10K >> infrastructure silicons. >> >> The LLC is divided into 2N slices distributed across N Mesh tiles in a >> single-socket configuration. The driver always configures the same >> counter for all of the TADs. The user would end up effectively >> reserving one of eight counters in every TAD to look across all TADs. >> The occurrences of events are aggregated and presented to the user at >> the end of an application run. The driver does not provide a way for >> the user to partition TADs so that different TADs are used for >> different applications. > >Is that something you will want to do in the future? If you go with your current >approach of exposing a single "tad" unit to userspace, then you won't be able >to change that. > There is no intension to do partitioning of the TADs. I have thrown some light on it as it is a point to be stressed upon. >For the L3 PMUs (including on TX2). we expose per-node PMUs so why >shouldn't we do something similar here and expose each TAD region >separately? Even if userspace drives them all together, it gives you more >flexibility in the future if you _do_ want to be partition them up. > Marvell has no plans of providing per-node pmu statistics on CN10k platforms. >> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile index >> 5260b116c7da..2db5418d5b0a 100644 >> --- a/drivers/perf/Makefile >> +++ b/drivers/perf/Makefile >> @@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += >thunderx2_pmu.o >> obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o >> obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o >> obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o >> +obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += >marvell_cn10k_tad_pmu.o >> diff --git a/drivers/perf/marvell_cn10k_tad_pmu.c >> b/drivers/perf/marvell_cn10k_tad_pmu.c >> new file mode 100644 >> index 000000000000..aebb1a0028dc >> --- /dev/null >> +++ b/drivers/perf/marvell_cn10k_tad_pmu.c >> @@ -0,0 +1,429 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* Marvell CN10K LLC-TAD perf driver >> + * >> + * Copyright (C) 2021 Marvell >> + * >> + * This program is free software; you can redistribute it and/or >> +modify >> + * it under the terms of the GNU General Public License version 2 as >> + * published by the Free Software Foundation. >> + */ >> + >> +#define pr_fmt(fmt) "tad_pmu: " fmt >> + >> +#include <linux/module.h> >> +#include <linux/of.h> >> +#include <linux/of_address.h> >> +#include <linux/of_device.h> >> +#include <linux/cpuhotplug.h> >> +#include <linux/perf_event.h> >> +#include <linux/platform_device.h> >> + >> +#define TAD_PFC_OFFSET 0x0 >> +#define TAD_PFC(counter) (TAD_PFC_OFFSET | (counter << 3)) >> +#define TAD_PRF_OFFSET 0x100 >> +#define TAD_PRF(counter) (TAD_PRF_OFFSET | (counter << 3)) >> +#define TAD_PRF_CNTSEL_MASK 0xFF >> +#define TAD_MAX_COUNTERS 8 >> + >> +#define to_tad_pmu(p) (container_of(p, struct tad_pmu, pmu)) >> + >> +struct tad_region { >> + void __iomem *base; >> +}; >> + >> +struct tad_pmu { >> + struct pmu pmu; >> + struct tad_region *regions; >> + u32 region_cnt; >> + unsigned int cpu; >> + struct hlist_node node; >> + struct perf_event *events[TAD_MAX_COUNTERS]; >> + DECLARE_BITMAP(counters_map, TAD_MAX_COUNTERS); }; >> + >> +static int tad_pmu_cpuhp_state; >> + >> +static void tad_pmu_event_counter_read(struct perf_event *event) { >> + struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu); >> + struct hw_perf_event *hwc = &event->hw; >> + u32 counter_idx = hwc->idx; >> + u64 delta, prev, new; >> + int i; >> + >> + do { >> + prev = local64_read(&hwc->prev_count); >> + for (i = 0, new = 0; i < tad_pmu->region_cnt; i++) >> + new += readq(tad_pmu->regions[i].base + >> + TAD_PFC(counter_idx)); >> + } while (local64_cmpxchg(&hwc->prev_count, prev, new) != prev); > >If we expose each TAD individually, then this won't matter, but I'd be inclined >to move the counter summation outside of the cmpxchg() loop given that >readq (why not _relaxed?) is probably quite slow. > As clarified above partitioning of TADs is ruled out and situation of exposing individual TADs inappropriate. >> + >> + delta = (new - prev) & GENMASK_ULL(63, 0); > >This mask doesn't do anything. > Agreed, and will delete it. >> + local64_add(delta, &event->count); >> +} >> + >> +static void tad_pmu_event_counter_stop(struct perf_event *event, int >> +flags) { >> + struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu); >> + struct hw_perf_event *hwc = &event->hw; >> + u32 counter_idx = hwc->idx; >> + int i; >> + >> + /* TAD()_PFC() stop counting on the write >> + * which sets TAD()_PRF()[CNTSEL] == 0 >> + */ >> + for (i = 0; i < tad_pmu->region_cnt; i++) >> + writeq_relaxed(0, tad_pmu->regions[i].base + >> + TAD_PRF(counter_idx)); > >Please use braces around a multi-line conditional statement. > Taken. >Will Thanks, Bhaskara