Re: [PATCH v6 3/4] drivers/perf: add DesignWare PCIe PMU driver

Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> · Fri, 28 Jul 2023 20:41:30 +0800

On 2023/7/27 17:39, Jonathan Cameron wrote:
> On Tue, 6 Jun 2023 23:14:07 +0800
> Yicong Yang <yangyicong@xxxxxxxxxx> wrote:
> 
>> On 2023/6/6 15:49, Shuai Xue wrote:
>>> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
>>> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
>>> Core controller IP which provides statistics feature. The PMU is not a PCIe
>>> Root Complex integrated End Point(RCiEP) device but only register counters
>>> provided by each PCIe Root Port.
>>>
>>> To facilitate collection of statistics the controller provides the
>>> following two features for each Root Port:
>>>
>>> - Time Based Analysis (RX/TX data throughput and time spent in each
>>>   low-power LTSSM state)
>>> - Event counters (Error and Non-Error for lanes)
>>>
>>> Note, only one counter for each type and does not overflow interrupt.
>>>
>>> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
>>> named based the BDF of Root Port. For example,
>>>
>>>     30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>>>
>>> the PMU device name for this Root Port is dwc_rootport_3018.
>>>
>>> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>>>
>>>     $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>>
>>> average RX bandwidth can be calculated like this:
>>>
>>>     PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
>>> Reported-by: kernel test robot <lkp@xxxxxxxxx>
>>> Link: https://lore.kernel.org/oe-kbuild-all/202305170639.XU3djFZX-lkp@xxxxxxxxx/
>>> Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> 
> I'll review on top to avoid any duplication with Yicong.

Thank you! It also served as a reminder that I missed Yicong's email. It appears
that Thunderbird mistakenly moved his email to the junk folder, resulting in me
overlooking it.

> 
> Note I've cropped the stuff neither of us commented on so it's
> easier to spot the feedback.

Thank you for noting that. My feedback is replied inline.

> 
> Jonathan
> 
>>> ---
>>>  drivers/perf/Kconfig        |   7 +
>>>  drivers/perf/Makefile       |   1 +
>>>  drivers/perf/dwc_pcie_pmu.c | 706 ++++++++++++++++++++++++++++++++++++
>>>  3 files changed, 714 insertions(+)
>>>  create mode 100644 drivers/perf/dwc_pcie_pmu.c
>>>
>>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>>> index 711f82400086..6ff3921d7a62 100644
>>> --- a/drivers/perf/Kconfig
>>> +++ b/drivers/perf/Kconfig
>>> @@ -209,6 +209,13 @@ config MARVELL_CN10K_DDR_PMU
>>>  	  Enable perf support for Marvell DDR Performance monitoring
>>>  	  event on CN10K platform.
>>>  
>>> +config DWC_PCIE_PMU
>>> +	tristate "Enable Synopsys DesignWare PCIe PMU Support"
>>> +	depends on (ARM64 && PCI)
>>> +	help
>>> +	  Enable perf support for Synopsys DesignWare PCIe PMU Performance
>>> +	  monitoring event on Yitian 710 platform.
> 
> The documentation kind of implies this isn't platform specific.
> If some parts are (such as which events exist) then you may want to push
> that to userspace / perftool with appropriate matching against specific SoC.
> 
> If it is generic, then change this text to "event on platform including the Yitian 710."

It is generic without any platform specific, so I will change it as you expected.

> 
>>> +
>>>  source "drivers/perf/arm_cspmu/Kconfig"
>>>  
>>>  source "drivers/perf/amlogic/Kconfig"
> 
>>> new file mode 100644
>>> index 000000000000..8bfcf6e0662d
>>> --- /dev/null
>>> +++ b/drivers/perf/dwc_pcie_pmu.c
>>> @@ -0,0 +1,706 @@
> 
> ...
> 
>>> +
>>> +struct dwc_pcie_pmu {
>>> +	struct pci_dev		*pdev;		/* Root Port device */  
>>
>> If the root port removed after the probe of this PCIe PMU driver, we'll access the NULL
>> pointer. I didn't see you hold the root port to avoid the removal.

Do you mean that I should have a reference count of rootport by pci_dev_get() when allocating
pcie_pmu?

     pcie_pmu->pdev = pci_dev_get();

>>
>>> +	u16			ras_des;	/* RAS DES capability offset */
>>> +	u32			nr_lanes;
>>> +
>>> +	struct list_head	pmu_node;
>>> +	struct hlist_node	cpuhp_node;
>>> +	struct pmu		pmu;
>>> +	struct perf_event	*event;
>>> +	int			oncpu;
>>> +};
>>> +
>>> +struct dwc_pcie_pmu_priv {
>>> +	struct device *dev;
>>> +	struct list_head pmu_nodes;
>>> +};
>>> +
>>> +#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
>>> +  
>>
>> somebody told me to put @pmu as the first member then this macro will have no calculation. :)
>>

Aha, you are right, I will move it as a first member.

> ...
> 
>>> +static ssize_t dwc_pcie_event_show(struct device *dev,
>>> +				struct device_attribute *attr, char *buf)
>>> +{
>>> +	struct dwc_pcie_event_attr *eattr;
>>> +
>>> +	eattr = container_of(attr, typeof(*eattr), attr);
>>> +
>>> +	if (eattr->type == DWC_PCIE_LANE_EVENT)
>>> +		return sysfs_emit(buf, "eventid=0x%x,type=0x%x,lane=?\n",
>>> +				  eattr->eventid, eattr->type);
>>> +
> 
> Elsewhere you always check for DWC_PCIE_TIME_BASE_EVENT.
> Should probably do so here as well for consistency.

Yes, I will also add the check here.

> 
>>> +	return sysfs_emit(buf, "eventid=0x%x,type=0x%x\n", eattr->eventid,
>>> +		       eattr->type);
>>> +}
> 
>>> +static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
>>> +	/* Group #0 */
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(one_cycle, 0x00),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_L0S, 0x01),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(RX_L0S, 0x02),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L0, 0x03),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1, 0x04),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_1, 0x05),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_2, 0x06),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(CFG_RCVRY, 0x07),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_RX_L0S, 0x08),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_AUX, 0x09),
>>> +
>>> +	/* Group #1 */
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
>>> +	DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
>>> +
>>> +	/*
>>> +	 * Leave it to the user to specify the lane ID to avoid generating
>>> +	 * a list of hundreds of events.
>>> +	 */
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ack_dllp, 0x600),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_update_fc_dllp, 0x601),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ack_dllp, 0x602),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_update_fc_dllp, 0x603),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_nulified_tlp, 0x604),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_nulified_tlp, 0x605),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_duplicate_tl, 0x606),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_write, 0x700),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_read, 0x701),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_write, 0x702),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_read, 0x703),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_write, 0x704),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_read, 0x705),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_without_data, 0x706),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_with_data, 0x707),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_message_tlp, 0x708),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_atomic, 0x709),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_tlp_with_prefix, 0x70A),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_write, 0x70B),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_read, 0x70C),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_write, 0x70F),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_read, 0x710),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_without_data, 0x711),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_with_data, 0x712),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_message_tlp, 0x713),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_atomic, 0x714),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_tlp_with_prefix, 0x715),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ccix_tlp, 0x716),
>>> +	DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ccix_tlp, 0x717),
>>> +  
>>
>> Intended blank line?

Nope, will delete it.

>>
>>> +	NULL
>>> +};
> 
> 
> ...
> 
>>> +static u64 dwc_pcie_pmu_read_time_based_counter(struct dwc_pcie_pmu *pcie_pmu)
>>> +{
>>> +	struct pci_dev *pdev = pcie_pmu->pdev;
>>> +	u16 ras_des = pcie_pmu->ras_des;
>>> +	u64 count;
>>> +	u32 val;
>>> +
>>> +	pci_read_config_dword(
>>> +		pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH, &val);
>>> +	count = val;
>>> +	count <<= 32;
>>> +
>>> +	pci_read_config_dword(
>>> +		pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW, &val);
> 
> This looks like tearing can occur.  you probably need to protect against that
> (usual trick is re read the _HIGH part and if it changed, try again)
> 
> The hardware might prevent tearing (it would freeze the low register when you
> read the high one, then only let it change after a read of the low registers is
> done).  If that's the case - add a comment to say so.

Good catch, I will check with hardware designer and reply later.

> 
>>> +
>>> +	count += val;
>>> +
>>> +	return count;
>>> +}
>>> +
> 
> 
> ...
>>> +static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
>>> +{
>>> +	struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>> +	struct pci_dev *pdev = pcie_pmu->pdev;
>>> +	struct hw_perf_event *hwc = &event->hw;
>>> +	enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>>> +	int event_id = DWC_PCIE_EVENT_ID(event);
>>> +	int lane = DWC_PCIE_EVENT_LANE(event);
>>> +	u16 ras_des = pcie_pmu->ras_des;
>>> +	u32 ctrl;
>>> +
>>> +	/* Only one counter and it is in use */
> 
> Yikes. That's quite a restriction.  Probably good to mention in the docs.
> I'm a little confused about the architecture though - there seem to be separate
> registers for the Lane and time based events.  Can't count those at same time?
> 

I am not quite sure, I will double check it and reply later.

>>> +	if (pcie_pmu->event)
>>> +		return -ENOSPC;
>>> +
>>> +	pcie_pmu->event = event;
>>> +	hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
>>> +
>>> +	if (type == DWC_PCIE_LANE_EVENT) {
>>> +		/* EVENT_COUNTER_DATA_REG needs clear manually */
>>> +		ctrl = FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id) |
>>> +			FIELD_PREP(DWC_PCIE_CNT_LANE_SEL, lane) |
>>> +			FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF) |
>>> +			FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
>>> +		pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL,
>>> +				       ctrl);
>>> +	} else if (type == DWC_PCIE_TIME_BASE_EVENT) {
>>> +		/*
>>> +		 * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
>>> +		 * use it with any manually controlled duration. And it is
>>> +		 * cleared when next measurement starts.
>>> +		 */
>>> +		ctrl = FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id) |
>>> +			FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
>>> +				   DWC_PCIE_DURATION_MANUAL_CTL) |
>>> +			DWC_PCIE_TIME_BASED_CNT_ENABLE;
>>> +		pci_write_config_dword(
>>> +			pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, ctrl);
>>> +	}
>>> +
>>> +	if (flags & PERF_EF_START)
>>> +		dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
>>> +
>>> +	perf_event_update_userpage(event);
>>> +
>>> +	return 0;
>>> +}
> ...
> 
>>> +static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv)
>>> +{
>>> +	struct pci_dev *pdev = NULL;
>>> +	struct dwc_pcie_pmu *pcie_pmu;
>>> +	char *name;
>>> +	u32 bdf;
>>> +	int ret;
>>> +
>>> +	INIT_LIST_HEAD(&priv->pmu_nodes);
>>> +
>>> +	/* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
>>> +	for_each_pci_dev(pdev) {
>>> +		u16 vsec;
>>> +		u32 val;
>>> +
>>> +		if (!(pci_is_pcie(pdev) &&
>>> +		      pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
>>> +			continue;
>>> +
>>> +		vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
>>> +						DWC_PCIE_VSEC_RAS_DES_ID);
>>> +		if (!vsec)
>>> +			continue;
>>> +
>>> +		pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
>>> +		if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
>>> +		    PCI_VNDR_HEADER_LEN(val) != 0x100)
>>> +			continue;
>>> +		pci_dbg(pdev,
>>> +			"Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
>>> +
>>> +		bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
>>> +		name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
>>> +				      bdf);
>>> +		if (!name)
>>> +			return -ENOMEM;
>>> +
>>> +		/* All checks passed, go go go */
>>> +		pcie_pmu = devm_kzalloc(&pdev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
>>> +		if (!pcie_pmu) {
>>> +			pci_dev_put(pdev);  
>>
>> we need to call pci_dev_put on all the return branch below and above and after the for_each_pci_dev()
>> loop to keep the refcnt balance.
> 
> Good spot. I'd use a goto for this given there are lots of places.

Forgive me, it has been catched by other reviewers, I missed other return branches,
will fix it with goto.

> 
>>
>>> +			return -ENOMEM;
>>> +		}
>>> +
>>> +		pcie_pmu->pdev = pdev;
>>> +		pcie_pmu->ras_des = vsec;
>>> +		pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
>>> +		pcie_pmu->pmu = (struct pmu){
>>> +			.module		= THIS_MODULE,
>>> +			.attr_groups	= dwc_pcie_attr_groups,
>>> +			.capabilities	= PERF_PMU_CAP_NO_EXCLUDE,
>>> +			.task_ctx_nr	= perf_invalid_context,
>>> +			.event_init	= dwc_pcie_pmu_event_init,
>>> +			.add		= dwc_pcie_pmu_event_add,
>>> +			.del		= dwc_pcie_pmu_event_del,
>>> +			.start		= dwc_pcie_pmu_event_start,
>>> +			.stop		= dwc_pcie_pmu_event_stop,
>>> +			.read		= dwc_pcie_pmu_event_update,
>>> +		};
>>> +
>>> +		/* Add this instance to the list used by the offline callback */
>>> +		ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
>>> +					       &pcie_pmu->cpuhp_node);
>>> +		if (ret) {
>>> +			pci_err(pcie_pmu->pdev,
>>> +				"Error %d registering hotplug @%x\n", ret, bdf);
>>> +			return ret;
>>> +		}
> 
> Here you mix non devm_ handling in mid way through a series of devm_ calls.
> Whilst I 'think' what you have here is fine, I prefer to minimize thinking
> whilst reviewing and using devm_add_action_or_reset() with callbacks
> in appropriate places would ensure automatic unwinding in the error
> path deals with everything in the reverse order of setup.
> 
> You just need two instances - one to unwind the cpuhp_state_add_instance() and
> one to unwind the perf_pmu_register()

Cool, devm_add_action_or_reset saves my life. I will use it.

>  
>>> +		ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
>>> +		if (ret) {
>>> +			pci_err(pcie_pmu->pdev,
>>> +				"Error %d registering PMU @%x\n", ret, bdf);
>>> +			cpuhp_state_remove_instance_nocalls(
>>> +				dwc_pcie_pmu_hp_state, &pcie_pmu->cpuhp_node);
>>> +			return ret;
>>> +		}
>>> +
>>> +		/* Add registered PMUs and unregister them when this driver remove */
>>> +		list_add(&pcie_pmu->pmu_node, &priv->pmu_nodes);
> 
> This handling would be replaced by the tracking devm is doing for us. So I think
> there will be no need for the list.

You are right, will remove it.

> 
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
>>> +{
>>> +	struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
>>> +	struct dwc_pcie_pmu *pcie_pmu;
>>> +
>>> +	list_for_each_entry(pcie_pmu, &priv->pmu_nodes, pmu_node) {
>>> +		cpuhp_state_remove_instance(dwc_pcie_pmu_hp_state,
>>> +					    &pcie_pmu->cpuhp_node);
>>> +		perf_pmu_unregister(&pcie_pmu->pmu);  
>>
>> should unregister the PMU first, keep the order reverse to __dwc_pcie_pmu_probe().
> These two could have been handled via appropriate devm_add_action_or_reset()
> above and let that infrastructure unwind for us in the error path.
> 
> If anyone fixes the whole pmu drivers aren't removable mess, then we will
> also end up with remove handling for free :)

As replied above, will use devm_add_action_or_reset.
> 
>>
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
>>> +{
>>> +	struct dwc_pcie_pmu_priv *priv;
>>> +	int ret;
>>> +
>>> +	priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
>>> +	if (!priv)
>>> +		return -ENOMEM;
>>> +
>>> +	priv->dev = &pdev->dev;
>>> +	platform_set_drvdata(pdev, priv);
>>> +
>>> +	/* If one PMU registration fails, remove all. */
>>> +	ret = __dwc_pcie_pmu_probe(priv);
>>> +	if (ret) {
>>> +		dwc_pcie_pmu_remove(pdev);
> 
> There is a bit of mixing of devm and not here which makes things somewhat
> hard to reason about.  Perhaps take the whole unwind flow over to devm managed.
> See above.
> 

Got it, will do that.

>>> +		return ret;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_migrate(struct dwc_pcie_pmu *pcie_pmu, unsigned int cpu)
>>> +{
>>> +	/* This PMU does NOT support interrupt, just migrate context. */
>>> +	perf_pmu_migrate_context(&pcie_pmu->pmu, pcie_pmu->oncpu, cpu);
>>> +	pcie_pmu->oncpu = cpu;
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>>> +{
>>> +	struct dwc_pcie_pmu *pcie_pmu;
>>> +	struct pci_dev *pdev;
>>> +	int node;
>>> +
>>> +	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
>>> +	pdev = pcie_pmu->pdev;
>>> +	node = dev_to_node(&pdev->dev);
>>> +
>>> +	if (node != NUMA_NO_NODE && cpu_to_node(pcie_pmu->oncpu) != node &&
> 
> Perhaps worth a comment on when you might see node == NUMA_NO_NODE?
> Beyond NUMA being entirely disabled, I'd hope that never happens and for that you
> might be able to use a compile time check.
> 
> I wonder if this can be simplified by a flag that says if we are already in the
> right node? Might be easier to follow than having similar dance in online and offline
> to figure that out.

Ok, I will add a comment for NUMA_NO_NODE. If no numa support, I think
any CPU is fine to bind.

pcie_pmu->on_cpu may be a good choise to be used as a flag, right? pcie_pmu->on_cpu
will be set as -1 when pcie_pmu is allocated and then check in
dwc_pcie_pmu_online_cpu() first.

Then, the code will be:

static int dwc_pcie_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
{
	struct dwc_pcie_pmu *pcie_pmu;
	struct pci_dev *pdev;
	int node;

	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
	/* If another CPU is already managing this PMU, simply return. */
	if (pcie_pmu->on_cpu != -1)
		return 0;

	pdev = pcie_pmu->pdev;
	node = dev_to_node(&pdev->dev);

	/* Select the first CPU if no numa support. */
	if (node == NUMA_NO_NODE)
		pcie_pmu->on_cpu = cpu;
	else if (cpu_to_node(pcie_pmu->on_cpu) != node &&
		 cpu_to_node(cpu) == node)
		dwc_pcie_pmu_migrate(pcie_pmu, cpu);

	return 0;
}
> 
> 
>>> +	    cpu_to_node(cpu) == node)
>>> +		dwc_pcie_pmu_migrate(pcie_pmu, cpu);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>>> +{
>>> +	struct dwc_pcie_pmu *pcie_pmu;
>>> +	struct pci_dev *pdev;
>>> +	int node;
>>> +	cpumask_t mask;
>>> +	unsigned int target;
>>> +
>>> +	pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
>>> +	if (cpu != pcie_pmu->oncpu)
>>> +		return 0;
>>> +
>>> +	pdev = pcie_pmu->pdev;
>>> +	node = dev_to_node(&pdev->dev);
>>> +	if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
>>> +	    cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
>>> +		target = cpumask_any(&mask);  
>>
>> The cpumask_of_node() only contains the online CPUs so this branch is redundant. For arm64
>> using arch_numa.c the node cpumask is updated in numa_{add, remove}_cpu() and for other
>> arthitecture the behaviour should keep consistenct. Please correct my if I'm wrong.

I am afraid that the behaviour is not consistenct among all arthitecture and cpumask_of_node()
may contains the offline CPUs.

cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler and it is updated by:

	set_cpu_online(cpu, true);
	set_cpu_online(cpu, false);

cpumask_of_node() is a interface for `node_to_cpumask_map` which is updated by

	numa_{add, remove}_cpu()

On arm64, when a CPU receives a IPI_CPU_STOP interrupt, local_cpu_stop will set current CPU offline,
but it will not be remove from cpumask_of_node.

For ARM64 and RISC-V arthitecture, numa_remove_cpu() and set_cpu_online(cpu, false)
are both executed in __cpu_disable() when a CPU is brought down. But for arm32, only
set_cpu_online(cpu, false) is called in __cpu_disable().

>>
>>> +	else
>>> +		target = cpumask_any_but(cpu_online_mask, cpu);
> 
> If following above suggestion, would set flag to say in wrong node here - and wherever
> you end up in a node to start with...

Based above, I will ignore this comment.

> 
> 
>>> +	if (target < nr_cpu_ids)
>>> +		dwc_pcie_pmu_migrate(pcie_pmu, target);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static struct platform_driver dwc_pcie_pmu_driver = {
>>> +	.probe = dwc_pcie_pmu_probe,
>>> +	.remove = dwc_pcie_pmu_remove,
>>> +	.driver = {.name = "dwc_pcie_pmu",},
>>> +};
>>> +
>>> +static int __init dwc_pcie_pmu_init(void)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
>>> +				      "perf/dwc_pcie_pmu:online",
>>> +				      dwc_pcie_pmu_online_cpu,
>>> +				      dwc_pcie_pmu_offline_cpu);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +
>>> +	dwc_pcie_pmu_hp_state = ret;
>>> +
>>> +	ret = platform_driver_register(&dwc_pcie_pmu_driver);
>>> +	if (ret) {
>>> +		cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
>>> +		return ret;
>>> +	}
>>> +
>>> +	dwc_pcie_pmu_dev = platform_device_register_simple(
>>> +				"dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
>>> +	if (IS_ERR(dwc_pcie_pmu_dev)) {
>>> +		platform_driver_unregister(&dwc_pcie_pmu_driver);  
>>
>> On failure we also need to remove cpuhp state as well.
> 
> I'd suggest using gotos and a single error handling block. That
> makes it both harder to forget things like this and easier to
> compare that block with what happens in exit() - so slightly 
> easier to review!

Given that we have a appropriate way to tear down the PMUs via devm_add_action_or_reset(),
I am going to remove the redundant probe/remove framework via platform_driver_{un}register().
for_each probe process in __dwc_pcie_pmu_probe() will be move into dwc_pcie_pmu_init().
Is it a better way?

Thank you very much for your valuable comments.

Best Regards,
Shuai