thanks Mark for the comments. On Thu, Jul 12, 2018 at 10:26 PM, Mark Rutland <mark.rutland@xxxxxxx> wrote: > Hi Ganapat, > > On Thu, Jun 21, 2018 at 12:03:38PM +0530, Ganapatrao Kulkarni wrote: >> This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory >> Controller(DMC) and Level 3 Cache(L3C). >> >> ThunderX2 has 8 independent DMC PMUs to capture performance events >> corresponding to 8 channels of DDR4 Memory Controller and 16 independent >> L3C PMUs to capture events corresponding to 16 tiles of L3 cache. >> Each PMU supports up to 4 counters. All counters lack overflow interrupt >> and are sampled periodically. >> >> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@xxxxxxxxxx> >> --- >> drivers/perf/Kconfig | 8 + >> drivers/perf/Makefile | 1 + >> drivers/perf/thunderx2_pmu.c | 949 +++++++++++++++++++++++++++++++++++++++++++ >> include/linux/cpuhotplug.h | 1 + >> 4 files changed, 959 insertions(+) >> create mode 100644 drivers/perf/thunderx2_pmu.c >> >> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig >> index 08ebaf7..ecedb9e 100644 >> --- a/drivers/perf/Kconfig >> +++ b/drivers/perf/Kconfig >> @@ -87,6 +87,14 @@ config QCOM_L3_PMU >> Adds the L3 cache PMU into the perf events subsystem for >> monitoring L3 cache events. >> >> +config THUNDERX2_PMU >> + bool "Cavium ThunderX2 SoC PMU UNCORE" >> + depends on ARCH_THUNDER2 && ARM64 && ACPI > > Surely this depends on NUMA for the node id? it is, i will update. > > [...] > >> +/* >> + * pmu on each socket has 2 uncore devices(dmc and l3), >> + * each uncore device has up to 16 channels, each channel can sample >> + * events independently with counters up to 4. >> + */ >> +struct thunderx2_pmu_uncore_channel { >> + struct pmu pmu; >> + struct hlist_node node; >> + struct thunderx2_pmu_uncore_dev *uncore_dev; >> + int channel; >> + int cpu; >> + DECLARE_BITMAP(active_counters, UNCORE_MAX_COUNTERS); >> + struct perf_event *events[UNCORE_MAX_COUNTERS]; >> + struct hrtimer hrtimer; >> + /* to sync counter alloc/release */ >> + raw_spinlock_t lock; >> +}; > > This lock should not be necessary. The pmu::{add,del} callbacks are > strictly serialised w.r.t. one another per CPU because the core perf > code holds ctx->lock whenever it calls either. > > Previously, you mentioned you'd seen scheduling while atomic when this > lock was removed. Could you please provide a backtrace? yes, i have seen, if i try to kick in simulatneously more events than h/w counters available(4 here). i will try with v6 and send the trace asap. > > [...] > > It would be worth a comment here explaining which resources the channel > PMUs share, and why we need this thunderx2_pmu_uncore_dev. IIUC, it's > just that they share a register windows switched by FW? > >> +struct thunderx2_pmu_uncore_dev { >> + char *name; >> + struct device *dev; >> + enum thunderx2_uncore_type type; >> + void __iomem *base; >> + int node; >> + u32 max_counters; >> + u32 max_channels; >> + u32 max_events; >> + u64 hrtimer_interval; >> + /* this lock synchronizes across channels */ >> + raw_spinlock_t lock; >> + const struct attribute_group **attr_groups; >> + void (*init_cntr_base)(struct perf_event *event, >> + struct thunderx2_pmu_uncore_dev *uncore_dev); >> + void (*select_channel)(struct perf_event *event); >> + void (*stop_event)(struct perf_event *event); >> + void (*start_event)(struct perf_event *event, int flags); >> +}; > > As a general thing, the really long structure/enum/macro names make > things really hard to read. > > Could we s/THUNDERX2/TX2/ and s/thunderx2/tx2/ for identifiers please? thanks, will do. > > Similarly: > > * s/thunderx2_pmu_uncore_channel/tx2_pmu/ > > * s/thunderx2_pmu_uncore_dev/tx2_pmu_group/ > > ... and consistently name those "pmu" and "pmu_group" respectively -- > that mekas the relationship between the two much clearer for someone who > is not intimately familiar with the hardware. These PMUs(UNCORE) counters makes only sense to someone who is familiar with the hardware. IMHO, the current names aligned to hardware design like channels , tiles, counter etc, which is explained in PATCH1/2. I will try to simplify long names. > > That makes things far more legible, e.g. > >> +static inline struct thunderx2_pmu_uncore_channel * >> +pmu_to_thunderx2_pmu_uncore(struct pmu *pmu) >> +{ >> + return container_of(pmu, struct thunderx2_pmu_uncore_channel, pmu); >> +} > > ... becomes: > > static struct tx2_pmu *pmu_to_tx2_pmu(struct pmu *pmu) > { > return container_of(pmu, struct tx2_pmu, pmu); > } > > [...] > >> +/* >> + * sysfs cpumask attributes >> + */ >> +static ssize_t cpumask_show(struct device *dev, >> + struct device_attribute *attr, char *buf) >> +{ >> + struct cpumask cpu_mask; >> + struct thunderx2_pmu_uncore_channel *pmu_uncore = >> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev)); >> + >> + cpumask_clear(&cpu_mask); >> + cpumask_set_cpu(pmu_uncore->cpu, &cpu_mask); >> + return cpumap_print_to_pagebuf(true, buf, &cpu_mask); >> +} >> +static DEVICE_ATTR_RO(cpumask); > > This can be simplified to: > > static ssize_t cpumask_show(struct device *dev, struct device_attribute > *attr, char *buf) > { > struct thunderx2_pmu_uncore_channel *pmu_uncore; > pmu_uncore = pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev)); > > return cpumap_print_to_pagebuf(true, buf, cpumask_of(pmu_uncore->cpu)); > } thanks, will do > > [...] > >> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore) >> +{ >> + int counter; >> + >> + raw_spin_lock(&pmu_uncore->lock); >> + counter = find_first_zero_bit(pmu_uncore->active_counters, >> + pmu_uncore->uncore_dev->max_counters); >> + if (counter == pmu_uncore->uncore_dev->max_counters) { >> + raw_spin_unlock(&pmu_uncore->lock); >> + return -ENOSPC; >> + } >> + set_bit(counter, pmu_uncore->active_counters); >> + raw_spin_unlock(&pmu_uncore->lock); >> + return counter; >> +} >> + >> +static void free_counter( >> + struct thunderx2_pmu_uncore_channel *pmu_uncore, int counter) >> +{ >> + raw_spin_lock(&pmu_uncore->lock); >> + clear_bit(counter, pmu_uncore->active_counters); >> + raw_spin_unlock(&pmu_uncore->lock); >> +} > > As above, I still don't believe that these locks are necessary, unless > we have a bug elsewhere. i will try to debug and provide you more info. > > [...] > >> +/* >> + * DMC and L3 counter interface is muxed across all channels. >> + * hence we need to select the channel before accessing counter >> + * data/control registers. >> + * >> + * L3 Tile and DMC channel selection is through SMC call >> + * SMC call arguments, >> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id) >> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel) >> + * x2 = Node id >> + * x3 = DMC(1)/L3C(0) >> + * x4 = channel Id >> + */ > > Please document which ID space the node id is in, e.g. > > x2 = FW Node ID (matching SRAT/SLIT IDs) > > ... so that it's clear that this isn't a Linux logical node id, even if > those happen to be the same today. sure. > > [...] > >> +static void thunderx2_uncore_start(struct perf_event *event, int flags) >> +{ >> + struct hw_perf_event *hwc = &event->hw; >> + struct thunderx2_pmu_uncore_channel *pmu_uncore; >> + struct thunderx2_pmu_uncore_dev *uncore_dev; >> + unsigned long irqflags; >> + >> + hwc->state = 0; >> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu); >> + uncore_dev = pmu_uncore->uncore_dev; >> + >> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags); >> + uncore_dev->select_channel(event); >> + uncore_dev->start_event(event, flags); >> + raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags); >> + perf_event_update_userpage(event); >> + >> + if (!find_last_bit(pmu_uncore->active_counters, >> + pmu_uncore->uncore_dev->max_counters)) { > > This would be clearer using !bitmap_empty(). thanks. > >> + hrtimer_start(&pmu_uncore->hrtimer, >> + ns_to_ktime(uncore_dev->hrtimer_interval), >> + HRTIMER_MODE_REL_PINNED); >> + } >> +} > > [...] > >> +static enum hrtimer_restart thunderx2_uncore_hrtimer_callback( >> + struct hrtimer *hrt) > > s/hrt/timer/ please sure, thanks. > >> +{ >> + struct thunderx2_pmu_uncore_channel *pmu_uncore; >> + unsigned long irqflags; >> + int idx; >> + bool restart_timer = false; >> + >> + pmu_uncore = container_of(hrt, struct thunderx2_pmu_uncore_channel, >> + hrtimer); >> + >> + raw_spin_lock_irqsave(&pmu_uncore->uncore_dev->lock, irqflags); >> + for_each_set_bit(idx, pmu_uncore->active_counters, >> + pmu_uncore->uncore_dev->max_counters) { >> + struct perf_event *event = pmu_uncore->events[idx]; >> + >> + thunderx2_uncore_update(event); >> + restart_timer = true; >> + } >> + raw_spin_unlock_irqrestore(&pmu_uncore->uncore_dev->lock, irqflags); >> + >> + if (restart_timer) >> + hrtimer_forward_now(hrt, >> + ns_to_ktime( >> + pmu_uncore->uncore_dev->hrtimer_interval)); >> + >> + return restart_timer ? HRTIMER_RESTART : HRTIMER_NORESTART; >> +} > > You don't need to take the lock at all if there are no active events, > and we can avoid all the conditional logic that way. e.g. assuming the > renames I requested above: > > static enum hrtimer_restart tx2_hrtimer_callback(struct hrtimer *timer) > > { > struct tx2_pmu *pmu; > struct tx2_pmu_group *pmu_group; > unsigned long irqflags; > int max_counters; > int idx; > > pmu = container_of(timer, struct tx2_pmu, hrtimer); > pmu_group = pmu->pmu_group; > max_counters = pmu_group->max_counters; > > if (cpumask_empty(pmu->active_counters, max_counters)) > return HRTIMER_NORESTART; > > raw_spin_lock_irqsave(&pmu_group->lock, irqflags); > for_each_set_bit(idx, pmu->active_counters, max_counters) { > struct perf_event *event = pmu->events[idx]; > > tx2_uncore_update(event); > restart_timer = true; > } > raw_spin_unlock_irqrestore(&pmu_group->lock, irqflags); > > hrtimer_forward_now(hrt, ns_to_ktime(pmu_group->hrtimer_interval)); > > return HRTIMER_RESTART; > } thanks, i will adopt this function. > > [...] > >> + switch (uncore_dev->type) { >> + case PMU_TYPE_L3C: >> + uncore_dev->max_counters = UNCORE_MAX_COUNTERS; >> + uncore_dev->max_channels = UNCORE_L3_MAX_TILES; >> + uncore_dev->max_events = L3_EVENT_MAX; >> + uncore_dev->hrtimer_interval = UNCORE_HRTIMER_INTERVAL; >> + uncore_dev->attr_groups = l3c_pmu_attr_groups; >> + uncore_dev->name = devm_kasprintf(dev, GFP_KERNEL, >> + "uncore_l3c_%d", uncore_dev->node); >> + uncore_dev->init_cntr_base = init_cntr_base_l3c; >> + uncore_dev->start_event = uncore_start_event_l3c; >> + uncore_dev->stop_event = uncore_stop_event_l3c; >> + uncore_dev->select_channel = uncore_select_channel; >> + break; >> + case PMU_TYPE_DMC: >> + uncore_dev->max_counters = UNCORE_MAX_COUNTERS; >> + uncore_dev->max_channels = UNCORE_DMC_MAX_CHANNELS; >> + uncore_dev->max_events = DMC_EVENT_MAX; >> + uncore_dev->hrtimer_interval = UNCORE_HRTIMER_INTERVAL; >> + uncore_dev->attr_groups = dmc_pmu_attr_groups; >> + uncore_dev->name = devm_kasprintf(dev, GFP_KERNEL, >> + "uncore_dmc_%d", uncore_dev->node); >> + uncore_dev->init_cntr_base = init_cntr_base_dmc; >> + uncore_dev->start_event = uncore_start_event_dmc; >> + uncore_dev->stop_event = uncore_stop_event_dmc; >> + uncore_dev->select_channel = uncore_select_channel; >> + break; > > We should probably s/uncore/tx2/, or s/uncore/thunderx2/ to namespace > this. ok, i would preffer tx2. > >> +static int thunderx2_uncore_pmu_offline_cpu(unsigned int cpu, >> + struct hlist_node *node) >> +{ >> + int new_cpu; >> + struct thunderx2_pmu_uncore_channel *pmu_uncore; >> + >> + pmu_uncore = hlist_entry_safe(node, >> + struct thunderx2_pmu_uncore_channel, node); >> + if (cpu != pmu_uncore->cpu) >> + return 0; >> + >> + new_cpu = cpumask_any_and( >> + cpumask_of_node(pmu_uncore->uncore_dev->node), >> + cpu_online_mask); >> + if (new_cpu >= nr_cpu_ids) >> + return 0; >> + >> + pmu_uncore->cpu = new_cpu; >> + perf_pmu_migrate_context(&pmu_uncore->pmu, cpu, new_cpu); >> + return 0; >> +} > > We'll also need a onlining callback. Consider if all CPUs in a NUMA node > were offlined, then we tried to online an arbitrary CPU from that node. sure, will add. > > Thanks, > Mark. thanks Ganapat -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html