On Mon, 6 Feb 2023 14:51:44 +0800 Jie Zhan <zhanjie9@xxxxxxxxxxxxx> wrote: > HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads > PMU accesses from CPUs, handling the configuration, event switching, and > counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained > and multi-PMU-event CPU profiling, in which scenario the current 'perf' > scheme may lose events or drop sampling frequency. With PMCU, users can > reliably obtain the data of up to 240 PMU events with the sample interval > of events down to 1ms, while the software overhead of accessing PMUs, as > well as its impact on target workloads, is reduced. > > This driver enables the usage of PMCU through the perf_event framework. > PMCU is registered as a PMU device and utilises the AUX buffer to dump data > directly. Users can start PMCU sampling through 'perf-record'. Event > numbers are passed by a sysfs interface. > > Signed-off-by: Jie Zhan <zhanjie9@xxxxxxxxxxxxx> Hi Jie, A few minor comments inline. Whilst I looked at this internally, that was a while back so I've found a few new things to point out in what I think is a pretty good/clean driver. The main thing here is the RFC questions you've raised in the cover letter of course - particularly the one around mediating who has the counters between this and the normal PMU driver. Thanks, Jonathan > --- > drivers/perf/hisilicon/Kconfig | 15 + > drivers/perf/hisilicon/Makefile | 1 + > drivers/perf/hisilicon/hisi_pmcu.c | 1096 ++++++++++++++++++++++++++++ > 3 files changed, 1112 insertions(+) > create mode 100644 drivers/perf/hisilicon/hisi_pmcu.c > > diff --git a/drivers/perf/hisilicon/Kconfig b/drivers/perf/hisilicon/Kconfig > index 171bfc1b6bc2..d7728fbe8519 100644 > --- a/drivers/perf/hisilicon/Kconfig > +++ b/drivers/perf/hisilicon/Kconfig > @@ -24,3 +24,18 @@ config HNS3_PMU > devices. > Adds the HNS3 PMU into perf events system for monitoring latency, > bandwidth etc. > + > +config HISI_PMCU > + tristate "HiSilicon PMCU" > + depends on ARM64 && PID_IN_CONTEXTIDR > + help > + Support for HiSilicon Performance Monitor Control Unit (PMCU). > + HiSilicon Performance Monitor Control Unit (PMCU) is a device that > + offloads PMU accesses from CPUs, handling the configuration, event > + switching, and counter reading of core PMUs on Kunpeng SoC. It > + facilitates fine-grained and multi-PMU-event CPU profiling, in which > + scenario the current 'perf' scheme may lose events or drop sampling > + frequency. With PMCU, users can reliably obtain the data of up to 240 > + PMU events with the sample interval of events down to 1ms, while the > + software overhead of accessing PMUs, as well as its impact on target > + workloads, is reduced. > diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile > index 4d2c9abe3372..93e4e6f2816a 100644 > --- a/drivers/perf/hisilicon/Makefile > +++ b/drivers/perf/hisilicon/Makefile > @@ -5,3 +5,4 @@ obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o \ > > obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o > obj-$(CONFIG_HNS3_PMU) += hns3_pmu.o > +obj-$(CONFIG_HISI_PMCU) += hisi_pmcu.o > diff --git a/drivers/perf/hisilicon/hisi_pmcu.c b/drivers/perf/hisilicon/hisi_pmcu.c > new file mode 100644 > index 000000000000..6ec5d6c31e1f > --- /dev/null > +++ b/drivers/perf/hisilicon/hisi_pmcu.c > @@ -0,0 +1,1096 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * HiSilicon Performance Monitor Control Unit (PMCU) driver > + * > + * Copyright (C) 2022 HiSilicon Limited > + * Author: Jie Zhan <zhanjie9@xxxxxxxxxxxxx> > + */ > + > +#include <linux/acpi.h> Not seeing this used. Probably want mod_devicetable.h that includes the struct acpi_device_id definition. > +#include <linux/bitfield.h> > +#include <linux/bits.h> > +#include <linux/cpumask.h> > +#include <linux/delay.h> > +#include <linux/dev_printk.h> > +#include <linux/device.h> > +#include <linux/dma-mapping.h> > +#include <linux/errno.h> > +#include <linux/gfp_types.h> It is very very rare for this to be included directly. Normally just rely on indirect inclusion from slab.h or similar. I would drop this one. > +#include <linux/interrupt.h> > +#include <linux/kernel.h> > +#include <linux/mm_types.h> > +#include <linux/module.h> > +#include <linux/perf_event.h> > +#include <linux/platform_device.h> > +#include <linux/printk.h> property.h > +#include <linux/slab.h> > +#include <linux/smp.h> > +#include <linux/threads.h> > +#include <linux/vmalloc.h> > + > +#include <asm/cputype.h> > +#include <asm/sysreg.h> > + > +/* Registers */ > +#define HISI_PMCU_REG_FSM_STATUS 0x0000 > +#define HISI_PMCU_REG_FSM_CFG 0x0004 > +#define HISI_PMCU_REG_EVENT_BASE_H 0x0008 > +#define HISI_PMCU_REG_EVENT_BASE_L 0x000C > +#define HISI_PMCU_REG_KILL_BASE_H 0x0010 > +#define HISI_PMCU_REG_KILL_BASE_L 0x0014 > +#define HISI_PMCU_REG_STORE_BASE_H 0x0018 > +#define HISI_PMCU_REG_STORE_BASE_L 0x001C > +#define HISI_PMCU_REG_WAIT_CNT 0x0020 > +#define HISI_PMCU_REG_FSM_CTRL 0x0038 > +#define HISI_PMCU_REG_FSM_BRK 0x003C > +#define HISI_PMCU_REG_COMP 0x0044 > +#define HISI_PMCU_REG_INT_EN 0x0100 > +#define HISI_PMCU_REG_INT_MSK 0x0104 > +#define HISI_PMCU_REG_INT_STAT 0x0108 > +#define HISI_PMCU_REG_INT_CLR 0x010C > +#define HISI_PMCU_REG_PMCR 0x0200 > +#define HISI_PMCU_REG_PMCCFILTR 0x0204 > + > +/* Register related configs */ > +#define HISI_PMCU_FSM_CFG_EV_LEN_MSK GENMASK(7, 0) > +#define HISI_PMCU_FSM_CFG_NR_LOOP_MSK GENMASK(15, 8) > +#define HISI_PMCU_FSM_CFG_NR_PMU_MSK GENMASK(19, 16) > +#define HISI_PMCU_FSM_CFG_MAX_EV_LEN 240 As this is used in various places that are only loosely assocated with this register, I'd just rename it HISI_PMCU_MAX_EVN_LEN. Similar probably applies to some of these others. > +#define HISI_PMCU_FSM_CFG_MAX_NR_LOOP 255 > +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU 8 > +#define HISI_PMCU_FSM_CFG_MAX_NR_PMU_C 5 > +#define HISI_PMCU_WAIT_CNT_DEFAULT 0x249F0 > +#define HISI_PMCU_FSM_CTRL_TRIGGER BIT(0) > +#define HISI_PMCU_FSM_BRK_BRK BIT(0) > +#define HISI_PMCU_COMP_HPMN_THR 3 > +#define HISI_PMCU_COMP_ENABLE BIT(0) > +#define HISI_PMCU_INT_DONE BIT(0) > +#define HISI_PMCU_INT_BRK BIT(1) > +#define HISI_PMCU_INT_ALL GENMASK(1, 0) > +#define HISI_PMCU_PMCR_DEFAULT 0xC1 How is this related to the architecture defined PMCR register? Or just a coincidence of naming? Either way, I'm assuming 0xC1 is probably multiple fields so if possible can we break this down further with defines to show where the value comes from. > +#define HISI_PMCU_PMCCFILTR_MSK GENMASK(31, 24) ... > +/** > + * struct hisi_pmcu_events - PMCU events and sampling configuration > + * @nr_pmu: number of core PMU counters that run in parallel > + * @padding: number of padding events in a sample > + * @nr_ev: number of events passed by users in a sample > + * @nr_ev_per_sample: number of events passed to hardware for a sample > + * This equals nr_ev + padding and should be evenly > + * divisible by nr_pmu. > + * @max_sample_loop: max number of samples that can be done in a loop > + * @ev_len: event length for hardware to read in a loop > + * @nr_loop: number of loops in one trigger > + * @comp_mode: compatibility mode > + * @nr_sample: number of samples that the current trigger takes > + * @nr_pending_sample: number of pending samples > + * @subsample_size: size of a subsample > + * @sample_size: size of a sample > + * @output_size: size of output from one trigger > + * @sample_period: sample period passed to hardware > + * @nr_cpu: number of hardware threads (logical CPUs) > + * @events: event IDs passed from users Maybe say what they are for rather than where they come from? event IDs to sample. > + */ > +struct hisi_pmcu_events { > + u8 nr_pmu; > + u8 padding; > + u8 nr_ev; > + u8 nr_ev_per_sample; > + u8 max_sample_loop; > + u8 ev_len; > + u8 nr_loop; > + u8 comp_mode; Could you use the enum hisi_pmcu_comp_mode type for this? > + u32 nr_sample; > + u32 nr_pending_sample; > + u32 subsample_size; > + u32 sample_size; > + u32 output_size; > + u32 sample_period; > + u32 nr_cpu; > + u32 events[HISI_PMCU_FSM_CFG_MAX_EV_LEN]; > +}; > + ... > +static const struct attribute_group hisi_pmcu_format_attr_group = { > + .name = "format", > + .attrs = hisi_pmcu_format_attrs, > +}; > + > +static ssize_t monitored_cpus_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev)); > + > + return sysfs_emit(buf, "%d-%d\n", > + cpumask_first(&hisi_pmcu->cpus), > + cpumask_last(&hisi_pmcu->cpus)); What does this do about offline CPUs? Should it include them or not? > +} > + > +static DEVICE_ATTR_ADMIN_RO(monitored_cpus); > +static ssize_t user_events_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(dev_get_drvdata(dev)); > + struct hisi_pmcu_user_events *user_ev = &hisi_pmcu->user_ev; > + u32 head, tail, nr_ev; > + char *line; > + int err; > + > + line = kcalloc(count + 1, sizeof(*line), GFP_KERNEL); Doesn't seem to be freed anywhere. > + nr_ev = 0; > + head = 0; > + tail = 0; > + while (nr_ev < HISI_PMCU_FSM_CFG_MAX_EV_LEN) { > + while (head < count && isspace(buf[head])) > + head++; > + if (!isxdigit(buf[head])) > + break; > + tail = head + 1; > + > + while (tail < count && isalnum(buf[tail])) > + tail++; > + > + strncpy(line, buf + head, tail - head); > + line[tail - head] = '\0'; > + err = kstrtou16(line, 16, &user_ev->ev[nr_ev]); > + if (err) { > + user_ev->nr_ev = 0; > + return err; > + } > + nr_ev++; > + head = tail; > + } > + user_ev->nr_ev = nr_ev; > + > + return count; > +} > + > +static int hisi_pmcu_pmu_event_init(struct perf_event *event) > +{ > + struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu); > + struct hisi_pmcu_events *ev = &hisi_pmcu->ev; > + void __iomem *base = hisi_pmcu->regbase; > + u64 cfg; > + u32 val; > + > + if (event->attr.type != hisi_pmcu->pmu.type) > + return -ENOENT; > + > + if (hisi_pmcu->busy) > + return -EBUSY; > + > + cfg = event->attr.config; > + > + val = FIELD_GET(HISI_PMCU_PERF_ATTR_NR_SAMPLE, cfg); val gets used for a lot of different things in this function. I would use as set of new local variables with names that make it more obvious what they are. > + ev->nr_pending_sample = val ? val : HISI_PMCU_PERF_NR_SAMPLE_DEFAULT; local variable isn't that useful here and makes the whole reuse of value issue worse. ev->nr_pending_sample = FIELD_GET(...); if (ev->nr_pending_sample == 0) ev->nr_pending_sample = HISI... > + > + val = FIELD_GET(HISI_PMCU_PERF_ATTR_SAMPLE_PERIOD_MS, cfg); > + if (val > HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS) { > + dev_err(hisi_pmcu->dev, "sample period too long (max=0x%x)\n", > + HISI_PMCU_PERF_MAX_SAMPLE_PERIOD_MS); > + return -EINVAL; > + } > + ev->sample_period = val ? val * HISI_PMCU_PERF_MS_TO_WAIT_CNT : > + HISI_PMCU_WAIT_CNT_DEFAULT; > + > + cfg = event->attr.config1; > + > + val = FIELD_GET(HISI_PMCU_PERF_ATTR_PMCCFILTR, cfg); > + val = FIELD_PREP(HISI_PMCU_PMCCFILTR_MSK, val); > + writel(val, base + HISI_PMCU_REG_PMCCFILTR); > + > + return 0; > +} > + ... > +static void hisi_pmcu_hw_sample_start(struct hisi_pmcu *hisi_pmcu, > + struct hisi_pmcu_buf *buf) > +{ > + struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf]; > + struct hisi_pmcu_events *ev = &hisi_pmcu->ev; > + void __iomem *base = hisi_pmcu->regbase; > + u64 addr, end; > + u32 val; > + > + /* FSM CFG */ > + val = FIELD_PREP(HISI_PMCU_FSM_CFG_EV_LEN_MSK, ev->ev_len); > + val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_LOOP_MSK, ev->nr_loop); > + val |= FIELD_PREP(HISI_PMCU_FSM_CFG_NR_PMU_MSK, ev->nr_pmu); > + writel(val, base + HISI_PMCU_REG_FSM_CFG); > + > + /* Sample period */ > + writel(ev->sample_period, base + HISI_PMCU_REG_WAIT_CNT); > + > + /* Event ID base */ > + addr = virt_to_phys(ev->events); > + val = upper_32_bits(addr); > + writel(val, base + HISI_PMCU_REG_EVENT_BASE_H); No point in using the local variable val here that I can see. writel(upper_32_bits(addr), base + ...) > + val = lower_32_bits(addr); > + writel(val, base + HISI_PMCU_REG_EVENT_BASE_L); same with this one. > + > + /* sbuf end */ > + end = page_to_phys(sbuf->page) + sbuf->size; > + > + /* Data output address */ > + addr = end - sbuf->remain; > + val = upper_32_bits(addr); > + writel(val, base + HISI_PMCU_REG_STORE_BASE_H); and this one. > + val = lower_32_bits(addr); > + writel(val, base + HISI_PMCU_REG_STORE_BASE_L); and another. etc.. > + > + /* Stop data output if sbuf end is reached (abnormally) */ > + addr = end; > + val = upper_32_bits(addr); > + writel(val, base + HISI_PMCU_REG_KILL_BASE_H); > + val = lower_32_bits(addr); > + writel(val, base + HISI_PMCU_REG_KILL_BASE_L); > + > + /* Trigger */ > + writel(HISI_PMCU_FSM_CTRL_TRIGGER, base + HISI_PMCU_REG_FSM_CTRL); > +} > + > + > +static void hisi_pmcu_write_auxtrace_header(struct hisi_pmcu_events *ev, > + struct hisi_pmcu_buf *buf) > +{ > + struct hisi_pmcu_auxtrace_header header; > + struct hisi_pmcu_sbuf *sbuf; > + u32 *data; > + u32 sz; > + > + sbuf = &buf->sbuf[buf->cur_buf]; > + > + header.buffer_size = sbuf->size; > + header.nr_pmu = ev->nr_pmu; > + header.nr_cpu = ev->nr_cpu; > + header.comp_mode = ev->comp_mode; > + header.subsample_size = ev->subsample_size; > + header.nr_subsample_per_sample = ev->nr_ev_per_sample / ev->nr_pmu; > + header.nr_event = ev->nr_ev_per_sample; Might be nicer to read as as struct hisi_pmcu_sbuf *sbuf = &buf->sbuf[buf->cur_buf]; struct hisi_pmcu_auxtrace_header header = { .buffer_size = sbuf->size, .nr_pmu = ev->nr_pmu, .nr_cpu = ev->nr_cpu, ... }; > + > + data = page_to_virt(sbuf->page); > + memcpy(data, &header, sizeof(header)); > + memcpy(data + sizeof(header) / sizeof(*data), ev->events, > + ev->nr_ev_per_sample * sizeof(u32)); I'm not sure why data is a u32 * A few things that would make this neater. * write the header directly. struct hisi_pmcu_auxtrace_header *header = page_to_virt(sbuf->page); *header = (struct hisi_pmcu_auxtrace_header) { .buffer_size = sbuf->size, .nr_pmu = ev->nr_pmu, .nr_cpu = ev->nr_cpu, ... }; * Use header + 1 to get to the address just after it. memcpy(header + 1, ev->events, ev->nr_ev_per_sample * sizeof(u32)); * Instead add teh data to the structure (maybe rename it) struct hisi_pmcu_auxtrace_header { u32 buffer_size; u32 nr_pmu; u32 nr_cpu; u32 comp_mode; u32 subsample_size; u32 nr_subsample_per_sample; u32 nr_event; u32 data[]; }; > + > + sz = sizeof(header) + ev->nr_ev_per_sample * sizeof(u32); with above augmented header structure, struct_size() > + sz = round_up(sz, HISI_PMCU_AUX_HEADER_ALIGN); > + > + sbuf->remain -= sz; > +} > + > +static void hisi_pmcu_pmu_start(struct perf_event *event, int flags) > +{ > + struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu); > + struct perf_output_handle *handle = &hisi_pmcu->handle; > + struct hw_perf_event *hwc = &event->hw; > + struct hisi_pmcu_buf *buf; > + int err; > + > + spin_lock(&hisi_pmcu->lock); > + > + if (hisi_pmcu->busy) { > + dev_info(hisi_pmcu->dev, > + "Sampling is running, pmu->start() ignored\n"); I'm not sure on perf convention on this, but I'd have though dev_dbg enough for this. If this is normal thing to do then feel free to leave it. > + goto out; > + } > + ... > + > +static void hisi_pmcu_pmu_stop(struct perf_event *event, int flags) > +{ > + struct hisi_pmcu *hisi_pmcu = to_hisi_pmcu(event->pmu); > + struct hw_perf_event *hwc = &event->hw; > + struct perf_output_handle *handle; > + struct hisi_pmcu_sbuf *sbuf; > + struct hisi_pmcu_buf *buf; > + int err; > + > + spin_lock(&hisi_pmcu->lock); > + > + handle = &hisi_pmcu->handle; > + > + /* If PMCU is running, break it */ > + if (hisi_pmcu->busy) { > + dev_info(hisi_pmcu->dev, "Stopping PMCU sampling\n"); Is this useful? dev_dbg maybe? > + err = hisi_pmcu_hw_sample_stop(hisi_pmcu); > + if (err) > + dev_err(hisi_pmcu->dev, > + "Timed out for stopping PMCU!\n"); > + } > + > + buf = perf_get_aux(handle); > + sbuf = &buf->sbuf[buf->cur_buf]; > + perf_aux_output_end(handle, sbuf->size - sbuf->remain); > + > + spin_unlock(&hisi_pmcu->lock); > + > + hwc->state |= PERF_HES_STOPPED; > + perf_event_update_userpage(event); > +} ... > +static int hisi_pmcu_init_data(struct platform_device *pdev, > + struct hisi_pmcu *hisi_pmcu) > +{ > + int ret; > + > + hisi_pmcu->regbase = devm_platform_ioremap_resource(pdev, 0); > + if (IS_ERR(hisi_pmcu->regbase)) > + return dev_err_probe(&pdev->dev, -ENODEV, > + "Failed to map device register space\n"); > + > + ret = device_property_read_u32(&pdev->dev, "hisilicon,scl-id", > + &hisi_pmcu->scclid); These need linux/property.h to be included (mentioned above) > + if (ret < 0) > + return dev_err_probe(&pdev->dev, ret, > + "Failed to read sccl-id!\n"); > + > + /* > + * Obtain the number of CPUs that contributes to the sample size. > + * NR_CPU_CLUSTER is now hard coded as the hardware accesses a certain > + * number of CPUs in a cluster regardless of how many CPUs are actually > + * implemented/available. > + */ > + ret = device_property_read_u32(&pdev->dev, "hisilicon,nr-cluster", > + &hisi_pmcu->ev.nr_cpu); > + if (ret < 0) > + return dev_err_probe(&pdev->dev, ret, > + "Failed to read nr-cluster!\n"); > + hisi_pmcu->ev.nr_cpu *= NR_CPU_CLUSTER; > + > + return 0; > +} > +static struct platform_driver hisi_pmcu_driver = { > + .driver = { > + .name = HISI_PMCU_DRV_NAME, > + .acpi_match_table = hisi_pmcu_acpi_match, > + /* > + * Unbinding driver is not yet supported as we have not worked > + * out a safe bind/unbind process. I'd add a note on this to the cover letter and the above patch description. It's definitely something we should make work for this driver. > + */ > + .suppress_bind_attrs = true, > + }, > + .probe = hisi_pmcu_probe, > +}; > + .. I'm very interested to see what this hardware/driver gets used for. Thanks, Jonathan