Re: [PATCH 0/9] perf: Avoid explicit cpumask var allocation from stack

Mark Rutland <mark.rutland@xxxxxxx> · Tue, 2 Apr 2024 15:41:51 +0100

On Tue, Apr 02, 2024 at 09:40:08PM +0800, Dawei Li wrote:
> Hi Mark,
> 
> Thanks for the quick review.
> 
> On Tue, Apr 02, 2024 at 12:12:50PM +0100, Mark Rutland wrote:
> > On Tue, Apr 02, 2024 at 06:56:01PM +0800, Dawei Li wrote:
> > > Hi,
> > > 
> > > This series try to eliminate direct cpumask var allocation from stack
> > > for perf subsystem.
> > > 
> > > Direct/explicit allocation of cpumask on stack could be dangerous since
> > > it can lead to stack overflow for systems with big NR_CPUS or
> > > CONFIG_CPUMASK_OFFSTACK=y.
> > > 
> > > For arm64, it's more urgent since commit 3fbd56f0e7c1 ("ARM64: Dynamically
> > > allocate cpumasks and increase supported CPUs to 512").
> > > 
> > > It's sort of a pattern that almost every cpumask var in perf subystem
> > > occurs in teardown callback of cpuhp. In which case, if dynamic
> > > allocation failed(which is unlikely), we choose return 0 rather than
> > > -ENOMEM to caller cuz:
> > > @teardown is not supposed to fail and if it does, system crashes:
> > 
> > .. but we've left the system in an incorrect state, so that makes no sense.
> > 
> > As I commented on the first patch, NAK to dynamically allocating cpumasks in
> > the CPUHP callbacks. Please allocate the necessry cpumasks up-front when we
> > probe the PMU. At that time we can handle an allocation failure by cleaning up
> > and failing to probe the PMU, and then the CPUHP callbacks don't need to
> > allocate memory to offline a CPU...
> 
> Agreed that dynamically allocation in callbacks lead to inconsistency
> to system.
> 
> My (original)alternative plan is simple but ugly, just make cpumask var
> _static_ and add extra static lock to protect it.
> 
> The only difference between solution above and your proposal is static/
> dynamic alloction. CPUHP's teardown cb is supposed to run in targetted
> cpuhp thread for most cases, and it's racy. Even the cpumask var is
> wrapped in dynamically allocated struct xxx_pmu, it's still shareable
> between different threads/contexts and needs proper protection.

I was under the impression that the cpuhp callbacks were *strictly* serialised.
If that's not the case, the existing offlining callbacks are horrendously
broken.

Are you *certain* these can race?

Regardless, adding additional locking here is not ok.

> Simple as this(_untested_):
> 
> diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
> index 7ef9c7e4836b..fa89c3db4d7d 100644
> --- a/drivers/perf/arm-cmn.c
> +++ b/drivers/perf/arm-cmn.c
> @@ -1950,18 +1950,24 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
>         struct arm_cmn *cmn;
>         unsigned int target;
>         int node;
> -       cpumask_t mask;
> +       static cpumask_t mask;
> +       static DEFINE_SPINLOCK(cpumask_lock);
> 
>         cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
>         if (cpu != cmn->cpu)
>                 return 0;
> 
> +       spin_lock(&cpumask_lock);
> +
>         node = dev_to_node(cmn->dev);
>         if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
>             cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
>                 target = cpumask_any(&mask);
>         else
>                 target = cpumask_any_but(cpu_online_mask, cpu);
> +
> +       spin_unlock(&cpumask_lock);
> +
>         if (target < nr_cpu_ids)
>                 arm_cmn_migrate(cmn, target);
>         return 0;

Looking at this case, the only reason we need the mask is because it made the
logic a little easier to write. All we really want is to choose some CPU in the
intersection of two masks ignoring a specific CPU, and there was no helper
function to do that.

We can add a new helper to do that for us, which would avoid redundant work to
manipulate the entire mask, and it would make the existing code simpler.  I had
a series a few years back to add cpumask_any_and_but():

  https://lore.kernel.org/lkml/1486381132-5610-1-git-send-email-mark.rutland@xxxxxxx/

... and that's easy to use here, e.g.

| diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
| index 7ef9c7e4836b7..c6bbd387ccf8b 100644
| --- a/drivers/perf/arm-cmn.c
| +++ b/drivers/perf/arm-cmn.c
| @@ -1950,17 +1950,15 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
|         struct arm_cmn *cmn;
|         unsigned int target;
|         int node;
| -       cpumask_t mask;
|  
|         cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
|         if (cpu != cmn->cpu)
|                 return 0;
|  
|         node = dev_to_node(cmn->dev);
| -       if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
| -           cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
| -               target = cpumask_any(&mask);
| -       else
| +       target = cpumask_any_and_but(cpu_online_mask, cpumask_of_node(node),
| +                                    cpu);
| +       if (target >= nr_cpu_ids)
|                 target = cpumask_any_but(cpu_online_mask, cpu);
|         if (target < nr_cpu_ids)
|                 arm_cmn_migrate(cmn, target);

It doesn't trivially rebase since the cpumask code has changed a fair amount,
but I've managed to do that locally, and I can send that out as a
seven-years-late v2 if it's useful.

>From a quick scan, it looks like that'd handle all cases in this series. Are
there any patterns in this series for which that would not be sufficient?

Mark.

> 
> And yes, static allocation is evil :) 
> 
> 
> Thanks,
> 
>     Dawei
> 
> > 
> > Also, for the titles it'd be better to say something like "avoid placing
> > cpumasks on the stack", because "explicit cpumask var allocation" sounds like
> > the use of alloc_cpumask_var().
> 
> Sound great! I will update it.
> 
> > 
> > Mark.
> > 
> > > 
> > > static int cpuhp_issue_call(int cpu, enum cpuhp_state state, bool bringup,
> > >                             struct hlist_node *node)
> > > {
> > >         struct cpuhp_step *sp = cpuhp_get_step(state);
> > >         int ret;
> > > 
> > >         /*
> > >          * If there's nothing to do, we done.
> > >          * Relies on the union for multi_instance.
> > >          */
> > >         if (cpuhp_step_empty(bringup, sp))
> > >                 return 0;
> > >         /*
> > >          * The non AP bound callbacks can fail on bringup. On teardown
> > >          * e.g. module removal we crash for now.
> > >          */
> > > 	#ifdef CONFIG_SMP
> > >         if (cpuhp_is_ap_state(state))
> > >                 ret = cpuhp_invoke_ap_callback(cpu, state, bringup, node);
> > >         else
> > >                 ret = cpuhp_invoke_callback(cpu, state, bringup, node,
> > > 		NULL);
> > > 	#else
> > >         ret = cpuhp_invoke_callback(cpu, state, bringup, node, NULL);
> > > 	#endif
> > >         BUG_ON(ret && !bringup);
> > >         return ret;
> > > }
> > > 
> > > Dawei Li (9):
> > >   perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from
> > >     stack
> > >   perf/arm-cmn: Avoid explicit cpumask var allocation from stack
> > >   perf/arm_cspmu: Avoid explicit cpumask var allocation from stack
> > >   perf/arm_dsu: Avoid explicit cpumask var allocation from stack
> > >   perf/dwc_pcie: Avoid explicit cpumask var allocation from stack
> > >   perf/hisi_pcie: Avoid explicit cpumask var allocation from stack
> > >   perf/hisi_uncore: Avoid explicit cpumask var allocation from stack
> > >   perf/qcom_l2: Avoid explicit cpumask var allocation from stack
> > >   perf/thunder_x2: Avoid explicit cpumask var allocation from stack
> > > 
> > >  drivers/perf/alibaba_uncore_drw_pmu.c    | 13 +++++++++----
> > >  drivers/perf/arm-cmn.c                   | 13 +++++++++----
> > >  drivers/perf/arm_cspmu/arm_cspmu.c       | 13 +++++++++----
> > >  drivers/perf/arm_dsu_pmu.c               | 18 +++++++++++++-----
> > >  drivers/perf/dwc_pcie_pmu.c              | 17 +++++++++++------
> > >  drivers/perf/hisilicon/hisi_pcie_pmu.c   | 15 ++++++++++-----
> > >  drivers/perf/hisilicon/hisi_uncore_pmu.c | 13 +++++++++----
> > >  drivers/perf/qcom_l2_pmu.c               | 15 ++++++++++-----
> > >  drivers/perf/thunderx2_pmu.c             | 20 ++++++++++++--------
> > >  9 files changed, 92 insertions(+), 45 deletions(-)
> > > 
> > > 
> > > Thanks,
> > > 
> > >     Dawei
> > > 
> > > -- 
> > > 2.27.0
> > > 
> >