On 2023/10/17 17:39, Jonathan Cameron wrote: > On Tue, 17 Oct 2023 09:32:34 +0800 > Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> wrote: > >> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support >> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express >> Core controller IP which provides statistics feature. The PMU is a PCIe >> configuration space register block provided by each PCIe Root Port in a >> Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error >> injection, and Statistics). >> >> To facilitate collection of statistics the controller provides the >> following two features for each Root Port: >> >> - one 64-bit counter for Time Based Analysis (RX/TX data throughput and >> time spent in each low-power LTSSM state) and >> - one 32-bit counter for Event Counting (error and non-error events for >> a specified lane) >> >> Note: There is no interrupt for counter overflow. >> >> This driver adds PMU devices for each PCIe Root Port. And the PMU device is >> named based the BDF of Root Port. For example, >> >> 30:03.0 PCI bridge: Device 1ded:8000 (rev 01) >> >> the PMU device name for this Root Port is dwc_rootport_3018. >> >> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes):: > > Question follow through from previous patch comment, why not just > multiply it by 16 when you read it? Is something in perf going to > overflow? As we discussed in Path 1/4, the unit 16 is not general for all groups of Time Based Analysis, I prefer to leave unit part to end perf users. > >> >> $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ >> >> average RX bandwidth can be calculated like this: >> >> PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window >> >> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > > Most of the comments inline aren't perf driver specific. To me that > part of it looks fine but I'm only an intermittent reviewer of perf > drivers, so that needs more eyes! > > Anyhow, to my eyes this is coming together well but there are a few things > that don't look quite right yet. Thank you for valuable comments, I will try my best to address them :) > > ... > >> +#define DWC_PCIE_LANE_EVENT_MAX_PERIOD GENMASK_ULL(31, 0) >> +#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD GENMASK_ULL(63, 0) >> + >> +struct dwc_pcie_pmu { >> + struct pmu pmu; >> + struct pci_dev *pdev; /* Root Port device */ >> + u16 ras_des; /* RAS DES capability offset */ > > Could call it ras_des_offset as then the comment wouldn't be needed... Agreed, will fix it. > >> + u32 nr_lanes; >> + >> + struct list_head pmu_node; >> + struct hlist_node cpuhp_node; >> + struct perf_event *event[DWC_PCIE_EVENT_TYPE_MAX]; >> + int on_cpu; >> + bool registered; >> +}; >> > > >> +static void dwc_pcie_pmu_unregister_pmu(void *data) >> +{ >> + struct dwc_pcie_pmu *pcie_pmu = data; >> + >> + if (!pcie_pmu->registered) >> + return; >> + >> + perf_pmu_unregister(&pcie_pmu->pmu); >> + pcie_pmu->registered = false; >> + list_del(&pcie_pmu->pmu_node); > For simplicity or reviewing I'd expect either: > a) Exact reverse order of what happened in probe. > b) A comment on why not. > > That probably jus means that the perf_pmu_unregister > should be first. I guess you mean perf_pmu_unregister should be last? Bellow is the exact reverse order of what happened in probe. pcie_pmu->registered = false; list_del(&pcie_pmu->pmu_node); perf_pmu_unregister(&pcie_pmu->pmu); > >> +} > > > ... > >> +static int dwc_pcie_pmu_probe(struct platform_device *plat_dev) >> +{ >> + struct pci_dev *pdev = NULL; >> + struct dwc_pcie_pmu *pcie_pmu; >> + bool notify = false; >> + char *name; >> + u32 bdf; >> + int ret; >> + >> + /* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */ >> + for_each_pci_dev(pdev) { >> + u16 vsec; >> + u32 val; >> + >> + if (!(pci_is_pcie(pdev) && >> + pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT)) >> + continue; >> + >> + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA, >> + DWC_PCIE_VSEC_RAS_DES_ID); >> + if (!vsec) >> + continue; >> + >> + pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val); >> + if (PCI_VNDR_HEADER_REV(val) != 0x04 || >> + PCI_VNDR_HEADER_LEN(val) != 0x100) > > I'm curious - why check the header length? Paranoia / defensive coding or does it vary? > I'd expect it to be fixed for a given revision. Ideally it should be backwards compatible > so that revs above 4 will always work (possibly with missing features) with rev 4 targeting > software but I guess we can't guaranteed that... Kind of defensive coding. Agreed, I will remove the hender length check to make the driver more compatible. > >> + continue; >> + pci_dbg(pdev, >> + "Detected PCIe Vendor-Specific Extended Capability RAS DES\n"); >> + >> + bdf = PCI_DEVID(pdev->bus->number, pdev->devfn); >> + name = devm_kasprintf(&plat_dev->dev, GFP_KERNEL, "dwc_rootport_%x", >> + bdf); >> + if (!name) { >> + ret = -ENOMEM; >> + goto out; >> + } >> + >> + /* All checks passed, go go go */ >> + pcie_pmu = devm_kzalloc(&plat_dev->dev, sizeof(*pcie_pmu), GFP_KERNEL); >> + if (!pcie_pmu) { >> + ret = -ENOMEM; >> + goto out; >> + } >> + >> + pcie_pmu->pdev = pdev; >> + pcie_pmu->ras_des = vsec; >> + pcie_pmu->nr_lanes = pcie_get_width_cap(pdev); >> + pcie_pmu->on_cpu = -1; >> + pcie_pmu->pmu = (struct pmu){ >> + .module = THIS_MODULE, >> + .attr_groups = dwc_pcie_attr_groups, >> + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, >> + .task_ctx_nr = perf_invalid_context, >> + .event_init = dwc_pcie_pmu_event_init, >> + .add = dwc_pcie_pmu_event_add, >> + .del = dwc_pcie_pmu_event_del, >> + .start = dwc_pcie_pmu_event_start, >> + .stop = dwc_pcie_pmu_event_stop, >> + .read = dwc_pcie_pmu_event_update, >> + }; >> + >> + /* Add this instance to the list used by the offline callback */ >> + ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state, >> + &pcie_pmu->cpuhp_node); >> + if (ret) { >> + pci_err(pcie_pmu->pdev, >> + "Error %d registering hotplug @%x\n", ret, bdf); >> + goto out; >> + } >> + >> + /* Unwind when platform driver removes */ >> + ret = devm_add_action_or_reset( >> + &plat_dev->dev, dwc_pcie_pmu_remove_cpuhp_instance, >> + &pcie_pmu->cpuhp_node); >> + if (ret) >> + goto out; >> + >> + ret = perf_pmu_register(&pcie_pmu->pmu, name, -1); >> + if (ret) { >> + pci_err(pcie_pmu->pdev, >> + "Error %d registering PMU @%x\n", ret, bdf); >> + goto out; >> + } >> + ret = devm_add_action_or_reset( >> + &plat_dev->dev, dwc_pcie_pmu_unregister_pmu, pcie_pmu); >> + if (ret) >> + goto out; > This is messy because your devm callback also deals with the bit below. > So if the _or_reset here happens because this call fails, the list_del will > happen on something that was never added. Simple fix is move this down to after > pcie_pmu->registered given none of the next few line of code can fail anyway. Goot point. I missed the fact that if devm_add_action_or_reset() fails, the action dwc_pcie_pmu_unregister_pmu() will be called right away. Will fix it in next version. > >> + >> + /* Cache PMU to handle pci device hotplug */ >> + list_add(&pcie_pmu->pmu_node, &dwc_pcie_pmu_head); >> + pcie_pmu->registered = true; >> + notify = true; >> + } >> + >> + if (notify && bus_register_notifier(&pci_bus_type, &dwc_pcie_pmu_nb)) >> + notify = false; > As mentioned below, I'd expect the bus_unregister_notifier to be in a remove() > callback, or you could use a devm_add_action_or_reset() an a simple callback. > You can register that as > > if (notify) { > if (bus_register_notifier() == 0) > ret = devm_add_action_or_reset(unreg_notifier, > &dwc_pcie_pmu_nb); > } > and then you don't need to track if it was registered or not as the > cleanup only happens if it was. Good suggestion. Will also use devm_add_action_or_reset() to unwind. > > >> + >> + if (notify) >> + dwc_pcie_pmu_notify = true; >> + >> + return 0; >> + >> +out: >> + pci_dev_put(pdev); >> + >> + return ret; >> +} >> + > > > ... > >> + >> +static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node) >> +{ >> + struct dwc_pcie_pmu *pcie_pmu; >> + struct pci_dev *pdev; >> + int node; >> + cpumask_t mask; >> + unsigned int target; >> + >> + pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node); >> + /* Nothing to do if this CPU doesn't own the PMU */ >> + if (cpu != pcie_pmu->on_cpu) >> + return 0; >> + >> + pcie_pmu->on_cpu = -1; >> + pdev = pcie_pmu->pdev; >> + node = dev_to_node(&pdev->dev); >> + if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) && >> + cpumask_andnot(&mask, &mask, cpumask_of(cpu))) >> + target = cpumask_any(&mask); >> + else >> + target = cpumask_any_but(cpu_online_mask, cpu); >> + >> + if (target >= nr_cpu_ids) { >> + pci_err(pcie_pmu->pdev, "There is no CPU to set\n"); > > You have a local variable for pdev, use it here as well. Will use local pdev directly. > >> + return 0; >> + } >> + >> + /* This PMU does NOT support interrupt, just migrate context. */ >> + perf_pmu_migrate_context(&pcie_pmu->pmu, cpu, target); >> + pcie_pmu->on_cpu = target; >> + >> + return 0; >> +} >> + >> +static struct platform_driver dwc_pcie_pmu_driver = { >> + .probe = dwc_pcie_pmu_probe, >> + .driver = {.name = "dwc_pcie_pmu",}, >> +}; >> + >> +static int __init dwc_pcie_pmu_init(void) >> +{ >> + int ret; >> + >> + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, >> + "perf/dwc_pcie_pmu:online", >> + dwc_pcie_pmu_online_cpu, >> + dwc_pcie_pmu_offline_cpu); >> + if (ret < 0) >> + return ret; >> + >> + dwc_pcie_pmu_hp_state = ret; >> + >> + ret = platform_driver_register(&dwc_pcie_pmu_driver); >> + if (ret) { >> + cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state); >> + return ret; >> + } >> + >> + dwc_pcie_pmu_dev = platform_device_register_simple( >> + "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0); >> + if (IS_ERR(dwc_pcie_pmu_dev)) { >> + platform_driver_unregister(&dwc_pcie_pmu_driver); > > Why no cpuhp_remove_multi_state() in this error path? > > I'd move to the approach of a gotos and a single error handling block as > that makes this sort of thing easier to spot. Agreed, will use the approach of a gotos to handle errors. > >> + return PTR_ERR(dwc_pcie_pmu_dev); >> + } >> + >> + return 0; >> +} >> + >> +static void __exit dwc_pcie_pmu_exit(void) >> +{ >> + platform_device_unregister(dwc_pcie_pmu_dev); >> + platform_driver_unregister(&dwc_pcie_pmu_driver); >> + cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state); >> + >> + if (dwc_pcie_pmu_notify) > > If you have something unusual like this a driver module_exit() it definitely > deserves a comment on why. I'm surprised by this as I'd expect the notifier > to be unregistered in the driver remove so not sure why this is here. > I've lost track of earlier discussions so if this was addressed then all > we need is a comment here for the next person to run into it! All replied above, I will unregistered the notifier by devm_add_action_or_reset(). I am curious about that what the difference between unregistered in module_exit() and remove()? > >> + bus_unregister_notifier(&pci_bus_type, &dwc_pcie_pmu_nb); >> +}