RE: [EXT] Re: [PATCH v6 1/2] drivers: perf: Add LLC-TAD perf counter support

Bhaskara Budiredla <bbudiredla@xxxxxxxxxxx> · Tue, 26 Oct 2021 18:04:22 +0000

>-----Original Message-----
>From: Will Deacon <will@xxxxxxxxxx>
>Sent: Tuesday, October 26, 2021 3:14 PM
>To: Bhaskara Budiredla <bbudiredla@xxxxxxxxxxx>
>Cc: mark.rutland@xxxxxxx; robh+dt@xxxxxxxxxx; Bharat Bhushan
><bbhushan2@xxxxxxxxxxx>; Sunil Kovvuri Goutham
><sgoutham@xxxxxxxxxxx>; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx;
>devicetree@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
>Subject: [EXT] Re: [PATCH v6 1/2] drivers: perf: Add LLC-TAD perf counter
>support
>
>External Email
>
>----------------------------------------------------------------------
>On Mon, Oct 18, 2021 at 09:00:56PM +0530, Bhaskara Budiredla wrote:
>> This driver adds support for Last-level cache tag-and-data unit
>> (LLC-TAD) PMU that is featured in some of the Marvell's CN10K
>> infrastructure silicons.
>>
>> The LLC is divided into 2N slices distributed across N Mesh tiles in a
>> single-socket configuration. The driver always configures the same
>> counter for all of the TADs. The user would end up effectively
>> reserving one of eight counters in every TAD to look across all TADs.
>> The occurrences of events are aggregated and presented to the user at
>> the end of an application run. The driver does not provide a way for
>> the user to partition TADs so that different TADs are used for
>> different applications.
>
>Is that something you will want to do in the future? If you go with your current
>approach of exposing a single "tad" unit to userspace, then you won't be able
>to change that.
>

There is no intension to do partitioning of the TADs. I have thrown some light on it as it is a point to be stressed upon.

>For the L3 PMUs (including on TX2). we expose per-node PMUs so why
>shouldn't we do something similar here and expose each TAD region
>separately? Even if userspace drives them all together, it gives you more
>flexibility in the future if you _do_ want to be partition them up.
>

Marvell has no plans of providing per-node pmu statistics on CN10k platforms.

>> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile index
>> 5260b116c7da..2db5418d5b0a 100644
>> --- a/drivers/perf/Makefile
>> +++ b/drivers/perf/Makefile
>> @@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) +=
>thunderx2_pmu.o
>>  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
>>  obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
>>  obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
>> +obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) +=
>marvell_cn10k_tad_pmu.o
>> diff --git a/drivers/perf/marvell_cn10k_tad_pmu.c
>> b/drivers/perf/marvell_cn10k_tad_pmu.c
>> new file mode 100644
>> index 000000000000..aebb1a0028dc
>> --- /dev/null
>> +++ b/drivers/perf/marvell_cn10k_tad_pmu.c
>> @@ -0,0 +1,429 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Marvell CN10K LLC-TAD perf driver
>> + *
>> + * Copyright (C) 2021 Marvell
>> + *
>> + * This program is free software; you can redistribute it and/or
>> +modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#define pr_fmt(fmt) "tad_pmu: " fmt
>> +
>> +#include <linux/module.h>
>> +#include <linux/of.h>
>> +#include <linux/of_address.h>
>> +#include <linux/of_device.h>
>> +#include <linux/cpuhotplug.h>
>> +#include <linux/perf_event.h>
>> +#include <linux/platform_device.h>
>> +
>> +#define TAD_PFC_OFFSET		0x0
>> +#define TAD_PFC(counter)	(TAD_PFC_OFFSET | (counter << 3))
>> +#define TAD_PRF_OFFSET		0x100
>> +#define TAD_PRF(counter)	(TAD_PRF_OFFSET | (counter << 3))
>> +#define TAD_PRF_CNTSEL_MASK	0xFF
>> +#define TAD_MAX_COUNTERS	8
>> +
>> +#define to_tad_pmu(p) (container_of(p, struct tad_pmu, pmu))
>> +
>> +struct tad_region {
>> +	void __iomem	*base;
>> +};
>> +
>> +struct tad_pmu {
>> +	struct pmu pmu;
>> +	struct tad_region *regions;
>> +	u32 region_cnt;
>> +	unsigned int cpu;
>> +	struct hlist_node node;
>> +	struct perf_event *events[TAD_MAX_COUNTERS];
>> +	DECLARE_BITMAP(counters_map, TAD_MAX_COUNTERS); };
>> +
>> +static int tad_pmu_cpuhp_state;
>> +
>> +static void tad_pmu_event_counter_read(struct perf_event *event) {
>> +	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
>> +	struct hw_perf_event *hwc = &event->hw;
>> +	u32 counter_idx = hwc->idx;
>> +	u64 delta, prev, new;
>> +	int i;
>> +
>> +	do {
>> +		prev = local64_read(&hwc->prev_count);
>> +		for (i = 0, new = 0; i < tad_pmu->region_cnt; i++)
>> +			new += readq(tad_pmu->regions[i].base +
>> +				     TAD_PFC(counter_idx));
>> +	} while (local64_cmpxchg(&hwc->prev_count, prev, new) != prev);
>
>If we expose each TAD individually, then this won't matter, but I'd be inclined
>to move the counter summation outside of the cmpxchg() loop given that
>readq (why not _relaxed?) is probably quite slow.
>

As clarified above partitioning of TADs is ruled out and situation of exposing individual TADs inappropriate.

>> +
>> +	delta = (new - prev) & GENMASK_ULL(63, 0);
>
>This mask doesn't do anything.
>

Agreed, and will delete it.

>> +	local64_add(delta, &event->count);
>> +}
>> +
>> +static void tad_pmu_event_counter_stop(struct perf_event *event, int
>> +flags) {
>> +	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
>> +	struct hw_perf_event *hwc = &event->hw;
>> +	u32 counter_idx = hwc->idx;
>> +	int i;
>> +
>> +	/* TAD()_PFC() stop counting on the write
>> +	 * which sets TAD()_PRF()[CNTSEL] == 0
>> +	 */
>> +	for (i = 0; i < tad_pmu->region_cnt; i++)
>> +		writeq_relaxed(0, tad_pmu->regions[i].base +
>> +			       TAD_PRF(counter_idx));
>
>Please use braces around a multi-line conditional statement.
>

Taken.

>Will

Thanks,
Bhaskara