Re: Kernel-managed IRQ affinity (cont)

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 10 Jan 2020 09:28:02 +0800

Hello Thomas,

On Thu, Jan 09, 2020 at 09:02:20PM +0100, Thomas Gleixner wrote:
> Ming,
> 
> Ming Lei <ming.lei@xxxxxxxxxx> writes:
> 
> > On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
> >> ... this one seems to be more appealing at least to me.
> >
> > OK, please try the following patch:
> >
> >
> > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > index 6c8512d3be88..0fbcbacd1b29 100644
> > --- a/include/linux/sched/isolation.h
> > +++ b/include/linux/sched/isolation.h
> > @@ -13,6 +13,7 @@ enum hk_flags {
> >  	HK_FLAG_TICK		= (1 << 4),
> >  	HK_FLAG_DOMAIN		= (1 << 5),
> >  	HK_FLAG_WQ		= (1 << 6),
> > +	HK_FLAG_MANAGED_IRQ	= (1 << 7),
> >  };
> >  
> >  #ifdef CONFIG_CPU_ISOLATION
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index 1753486b440c..0a75a09cc4e8 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/task.h>
> >  #include <uapi/linux/sched/types.h>
> >  #include <linux/task_work.h>
> > +#include <linux/sched/isolation.h>
> >  
> >  #include "internals.h"
> >  
> > @@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> >  {
> >  	struct irq_desc *desc = irq_data_to_desc(data);
> >  	struct irq_chip *chip = irq_data_get_irq_chip(data);
> > +	const struct cpumask *housekeeping_mask =
> > +		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
> >  	int ret;
> > +	cpumask_var_t tmp_mask;
> >  
> >  	if (!chip || !chip->irq_set_affinity)
> >  		return -EINVAL;
> >  
> > -	ret = chip->irq_set_affinity(data, mask, force);
> > +	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
> > +		return -EINVAL;
> 
> That's wrong. This code is called with interrupts disabled, so
> GFP_KERNEL is wrong. And NO, we won't do a GFP_ATOMIC allocation here.

OK, looks desc->lock is held.

> 
> > +	/*
> > +	 * Userspace can't change managed irq's affinity, make sure
> > +	 * that isolated CPU won't be selected as the effective CPU
> > +	 * if this irq's affinity includes both isolated CPU and
> > +	 * housekeeping CPU.
> > +	 *
> > +	 * This way guarantees that isolated CPU won't be interrupted
> > +	 * by IO submitted from housekeeping CPU.
> > +	 */
> > +	if (irqd_affinity_is_managed(data) &&
> > +			cpumask_intersects(mask, housekeeping_mask))
> > +		cpumask_and(tmp_mask, mask, housekeeping_mask);
> 
> This is duct tape engineering with absolutely no semantics. I can't even
> figure out the intent of this 'managed_irq' parameter.

The intent is to isolate the specified CPUs from handling managed interrupt.

For non-managed interrupt, the isolation is done via userspace because
userspace is allowed to change non-manage interrupt's affinity.

> 
> If the intent is to keep managed device interrupts away from isolated
> cores then you really want to do that when the interrupts are spread and
> not in the middle of the affinity setter code.
> 
> But first you need to define how that mask should work:
> 
>  1) Exclude CPUs from managed interrupt spreading completely
> 
>  2) Exclude CPUs only when the resulting spreading contains
>     housekeeping CPUs
> 
>  3) Whatever ...

We can do that. The big problem is that the RT case can't guarantee that
IO won't be submitted from isolated CPU always. blk-mq's queue mapping
relies on the setup affinity, so un-known behavior(kernel crash, or io
hang, or other) may be caused if we exclude isolated CPUs from interrupt
affinity.

That is why I try to exclude isolated CPUs from interrupt effective affinity,
turns out the approach is simple and doable.

Thanks,
Ming