irqbalancer subset policy and CPU lock up on storage controller.

nhorman@xxxxxxxxxxxxx (Neil Horman) · Tue, 13 Oct 2015 11:16:27 -0400

On Tue, Oct 13, 2015 at 06:40:06PM +0530, Kashyap Desai wrote:
> +0530, Kashyap Desai wrote:
> > > > On Mon, Oct 12, 2015 at 11:52:30PM +0530, Kashyap Desai wrote:
> > > > > > > What should be the solution if we really want to slow down IO
> > > > > > > submission to avoid CPU lockup. We don't want only one CPU to
> > > > > > > keep busy for completion.
> > > > > > >
> > > > > > > Any suggestion ?
> > > > > > >
> > > > > > Yup, file a bug with Oracle :)
> > > > >
> > > > > Neil -
> > > > >
> > > > > Thanks for info. I understood to use latest <irqbalance>...that
> > > > > was already attempted. I tried with latest irqbalance and I see
> > > > > expected behavior as long as I provide <exact> or <subset> + <--
> > poliicyscript>.
> > > > > We are planning for the same, but wanted to understand what is
> > > > > latest <irqbalancer> default settings. Is there any reason we are
> > > > > seeing default settings changed from  subset to ignore ?
> > > > >
> > > >
> > > > Latest defaults are that hinting is ignored by default, but hinting
> > > > can
> > > also be
> > > > set via a policyscript on an irq by irq basis.
> > > >
> > > > The reasons for changing the default behavior are documented in
> > > > commit d9138c78c3e8cb286864509fc444ebb4484c3d70.  Irq affinity
> > > > hinting is effectively a holdover from back in the days when
> > > > irqbalance couldn't understand a devices locality and irq count
> > > > easily.  Now that it can,
> > > there is
> > > > really no need for an irq affinity hint, unless your driver doesn't
> > > properly
> > > > participate in sysfs device ennumeration.
> > >
> > > Neil - I went through those details, but could not understand how
> > > <ignore> policy is useful. I may be missing something here. :-(
> > Yes, what you are missing is the fact that affinity hinting is an
> outdated
> > method of assigning affinty hints.  On any modern kernel its not needed
> at
> > all, so the default policy is to ignore it.
> 
> Now it clear. Understood that now no more affinity hint is required from
> driver and <irqbalance> can manage using required detail populated from
> <sysfs>.
> 

Yes, precisely.

> >
> > > With <ignore> policy, mpt3sas driver on 32 logical CPU system has
> > > below affinity mask. As you said, driver hint is ignored.  That is
> > > understood as <ignore> is hinting for the same, but why affinity mask
> > > is just localized to local node (Node 0 in this case) ?
> > This has nothing to do with ignoring hint policy.  The reasons the below
> > might occur are:
> >
> > 1) the class of the device on the pci bus is such that irqbalance is
> deciding
> > that numa node is the level at which it should be balanced.  Currently
> there
> > are no such devices that get balanced at that level.  There are however
> > package level balanced devices, and if you have a single cpu package
> (with
> > multiple
> > cores) on a single numa node, you might see this behavior. What is the
> pci
> > class of the mpt3sas adapter?
> 
> <mpt3sas> is <storage> class adapter. See <class> sysfs details-
> 
> [root]# cat /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.0/class
> 0x010700
> 
ok, good, that indicates that irqbalance is correctly identifying the adapter as
a scsi HBA, which should map irqs to a specific core

> 
> >
> > 2) The interrupt controller on your system doesn't allow for user
> setting of
> > interrupt affinity.  I don't think that would be the case given that
> other
> > interrupts can be affined.  If you can manually set the affinity of
> these irqs
> > you can discount this possibility.
> 
> 
> Affinity hint from driver provided by <exact> policy and manually setting
> affinity works on my setup. We can skip this part.
Right, good.

> From storage controller requirement side, we are looking for msix-vector
> and  logic CPU# mapping in same sequence.
> 
I don't know what you mean by that.  Are you saying that you want an msix-vector
to be mapped to a specific cpu, independent of numa node locality?  What
information do you base that assignment on?

> >
> > 3) You are using a policyscript that assigns these affinities.  As I
> previously
> > requested, are you using a policy script and can you post it here?
> 
> I have attached policy script (This very basic script..just to understand
> irqbalance we created this and we got our work done).
> I required - "use balancing as core level and distribute policy across
> each NUMA node."
> 

Ok, your script looks ok (though I'm not sure The kernel you are using supports
the existance of the driver directory in /proc/irq/<n>, so it may not be
functioning as expected).

> From attached irqbalance debug output, I can see that irqbalancer able to
> work as expected for policy script.
> 
Agreed, it does seem to be functioning properly.  Further, I see that those
interrupts are properly assigned to a single core.  Which leads me to wonder
exactly what is going on here. The debug output seems to contradict the affinity
mask information you provided earlier

> >
> > > What is confusing me is - "cpu affinity mask" is just localize to Numa
> > > Node-0  as PCI device enumeration detected pci device is local to
> > > numa_node 0.
> > I really dont know what you mean by this.  Yes, your masks seem to be
> > following what could be your numa node layout, but you're assuming (or
> it
> > sounds like you're assuming) that irqbalance is doing that
> intentionally.  Its
> > not, one of the above things is going on.
> >
> > >
> > >
> > > When you say "Driver does not participate in sysfs enumeration" - Does
> > > it mean "numa_node" exposure in sysfs or anything more than that ?
> > > Sorry for basics and helping me to understand things.
> > >
> > I mean, does your driver register itself as a pci device?  If so, it
> should have
> > a directory in sysfs in /sys/bus/pci/<pci b:d:f>/.  As long as that
> directory
> > exists and is properly populated, irqbalance should have everything it
> needs
> > to properly assign a cpu to all of your irqs.
> 
> Yes driver register device as pci device and I can see all
> /sys/bus/pci/devices/ entry for mpt3sas attached device.
> 
Agreed, your debug information bears that out.

> > Note that the RHEL6 kernel did
> > not always properly populate that directory.  I added sysfs code to
> expose
> > needed irq information in the kernel, and if you have an older kernel
> and
> > newer irqbalance, that might be part of the problem - another reason to
> > contact oracle.
> >
> >
> > another thing you can try is posting the output of irqbalance while
> running
> > it with -f and -d.  That will give us some insight as to what its doing
> (note
> > I'm referring here to upstream irqbalance, not the old version).  And
> you
> > still didn't answer my question regarding the policyscript.
> 
> I have attached irqbalance (latest from github last commit
> 8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and policy script.
> 
> Due to some reason, I have to move to different server which has 4 Numa
> socket.
> 
> Here Is detail of my setup -
> 
> [root]# lstopo-no-graphics
> Machine (64GB)
>   NUMANode L#0 (P#0 16GB)
>     Socket L#0 + L3 L#0 (10MB)
>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>         PU L#0 (P#0)
>         PU L#1 (P#16)
>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>         PU L#2 (P#1)
>         PU L#3 (P#17)
>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>         PU L#4 (P#2)
>         PU L#5 (P#18)
>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>         PU L#6 (P#3)
>         PU L#7 (P#19)
>     HostBridge L#0
>       PCIBridge
>         PCI 8086:0953
>       PCIBridge
>         PCI 1000:0097
>           Block L#0 "sdb"
>           Block L#1 "sdc"
>           Block L#2 "sdd"
>           Block L#3 "sde"
>           Block L#4 "sdf"
>           Block L#5 "sdg"
>           Block L#6 "sdh"
>       PCIBridge
>         PCI 102b:0532
>       PCI 8086:1d02
>         Block L#7 "sda"
>   NUMANode L#1 (P#1 16GB)
>     Socket L#1 + L3 L#1 (10MB)
>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>         PU L#8 (P#4)
>         PU L#9 (P#20)
>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>         PU L#10 (P#5)
>         PU L#11 (P#21)
>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>         PU L#12 (P#6)
>         PU L#13 (P#22)
>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>         PU L#14 (P#7)
>         PU L#15 (P#23)
>     HostBridge L#4
>       PCIBridge
>         PCI 1000:005b
>   NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB)
>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>       PU L#16 (P#8)
>       PU L#17 (P#24)
>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>       PU L#18 (P#9)
>       PU L#19 (P#25)
>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>       PU L#20 (P#10)
>       PU L#21 (P#26)
>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>       PU L#22 (P#11)
>       PU L#23 (P#27)
>   NUMANode L#3 (P#3 16GB)
>     Socket L#3 + L3 L#3 (10MB)
>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>         PU L#24 (P#12)
>         PU L#25 (P#28)
>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>         PU L#26 (P#13)
>         PU L#27 (P#29)
>       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>         PU L#28 (P#14)
>         PU L#29 (P#30)
>       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>         PU L#30 (P#15)
>         PU L#31 (P#31)
>     HostBridge L#6
>       PCIBridge
>         PCI 8086:1528
>           Net L#8 "enp193s0f0"
>         PCI 8086:1528
>           Net L#9 "enp193s0f1"
>       PCIBridge
>         PCI 8086:0953
>       PCIBridge
> 
> 
> Things are bit clear now, but what I am seeing here is - "With <ignore>
> and attached policy hint, CPU to MSIX mask is not same as logic sequence
> of CPU #. It is random within core. I guess it based on some link list in
> irqbalance, which I am not able to understand.

What exactly do you mean by logic sequence of cpu #?  Are you under the
impression that msix vectors should be mapped to their corresponding cpu number?
That is certainly not the case. irqbalance assigns irqs to cpus by estimating
how much cpu load each irq is responsible for, and assigning that irq to the cpu
that is the least loaded at the time of assignment (the idea being that cpu load
on each cpu is as close to equal as we can get it after balancing).  If you have
some additinoal need to place irqs on cpus such that some higher level task
exectues on the same cpu as the interrupt, it is the responsibility of the
higher level task to match the irq placement, not vice versa.  You can do some
trickery to make them match up (i.e. do balacing via  a policy script that
assigns cores based on the static configuration of the higher level tasks), but
such configurations are never going to be standard, and generally you need to
have those higher level tasks follow the lower level decisions.

> E.a  Below piece of code hint that per Core, irq numbers are stored in
> link list "d->interrupts". This list is not based on sequential traverse,
> rather based on how interrupts are generated. Right ?
> 
As noted above, its a measurement of approximate cpu load generated by each
interrupt.  As far as assignment goes, there is no relevance to which cpu
handles an interrupt within a numa node from the standpoint of the interrupt
itself.  If you have some requirement to align interrupts with other execution
contexts, then you either need to write you poilcy script to understand those
higher level configurations, or you need to modify your higher level setup to
follow what irqbalance does.

> static void dump_cache_domain(struct topo_obj *d, void *data)
> {
>         char *buffer = data;
>         cpumask_scnprintf(buffer, 4095, d->mask);
>         log(TO_CONSOLE, LOG_INFO, "%s%sCache domain %i:  numa_node is %d
> cpu mask is %s  (load %lu) \n",
>             log_indent, log_indent,
>             d->number, cache_domain_numa_node(d)->number, buffer,
> (unsigned long)d->load);
>         if (d->children)
>                 for_each_object(d->children, dump_balance_obj, NULL);
>         if (g_list_length(d->interrupts) > 0)
>                 for_each_irq(d->interrupts, dump_irq, (void *)10);
> }
> 
> I see sometimes different cpu masks  as below snippet -   cpu mask on my
> setup varies on run to run, but good thing is mask is within <core>, but
> not like <exact>.
> 
They're not going to be the same, I'm not sure why that is so hard to
understand.  All affinity_hint is is a drivers best guess as to where to put an
irq, and its, as a rule, sub-optimal.  Expecting irqbalance to arrive at the
same decision as affinity_hint independently is improper reasoning.

> 	msix index = 0, irq number =  355, cpu affinity mask = 00000008
> hint = 00000001
> 	msix index = 1, irq number =  356, cpu affinity mask = 00000004
> hint = 00000002
> 	msix index = 2, irq number =  357, cpu affinity mask = 00000002
> hint = 00000004
> 	msix index = 3, irq number =  358, cpu affinity mask = 00000001
> hint = 00000008
> 
> 
> 	msix index = 0, irq number =  355, cpu affinity mask = 00000002
> hint = 00000001
> 	msix index = 1, irq number =  356, cpu affinity mask = 00000008
> hint = 00000002
> 	msix index = 2, irq number =  357, cpu affinity mask = 00000004
> hint = 00000004
> 	msix index = 3, irq number =  358, cpu affinity mask = 00000001
> hint = 00000008
> 
> I am expecting  as below because once <mpt3sas> driver send IO it hint FW
> about completion queue. E.a if IO is submitted from logical CPU #X, driver
> use smp_process_id() to get that logical CPU #X and expect completion on
> same CPU for better performance.  Is this expectation possible with
> existing latest <irqbalance> ?
> 
It is, but only through the use of a policy script that you write to guarantee
that assignment.  With the information that irqbalance has at hand, there is no
need to balance irqs in the below manner.  Even if there is some upper layer
indicator that suggests it might be beneficial, theres nothing that irqbalance
can use to consistently determine that.  You can certainly do it with a policy
script if you like in any number of ways, but any logic you put in that script
is by and large going to remain there (i.e. its not going to become part of the
default behavior, unless you can provide:

a) a mechanism to consistently query devices to determine if they should be
balanced in this manner

b) evidence that for all devices of that class, this provides a performance
benefit.

In other words, you have to show me which irqs this is a benefit for, and how to
tell if an arbitrary irq fits that profile.

Note also, that, based on what you're saying above, I don't think this is going
to provide a consistent benefit.  It sounds like what you're suggesting is that
the affinity hint of the mpt3sas driver assigns hints based on what cpu it
expects io/requests to come in on (thereby matching completion interrupts with
the submitted io data).  But looking at the code:

a) there doesn't appear to be any assignment like that in mpt3sas, its just
blindly assigning hints based on the online cpus at the time the device was
registered.  From the mpt3sas driver in _base_assing_reply_queues:

list_for_each_entry(reply_q, &ioc->reply_queue_list, list) {

                unsigned int i, group = nr_cpus / nr_msix;

                if (cpu >= nr_cpus)
                        break;

                if (index < nr_cpus % nr_msix)
                        group++;

                for (i = 0 ; i < group ; i++) {
                        ioc->cpu_msix_table[cpu] = index;
                        cpumask_or(reply_q->affinity_hint,
                                   reply_q->affinity_hint, get_cpu_mask(cpu));
                        cpu = cpumask_next(cpu, cpu_online_mask);
                }

                if (irq_set_affinity_hint(reply_q->vector,
                                           reply_q->affinity_hint))
                        dinitprintk(ioc, pr_info(MPT3SAS_FMT
                            "error setting affinity hint for irq vector %d\n",
                            ioc->name, reply_q->vector));
                index++;
}

and
b) There is no guarantee that, for any given i/o, it will be submitted on a cpu
that is in a mask assigned by the affinity hint.

Neil

> 	msix index = 0, irq number =  355, cpu affinity mask = 00000001
> hint = 00000001
> 	msix index = 1, irq number =  356, cpu affinity mask = 00000002
> hint = 00000002
> 	msix index = 2, irq number =  357, cpu affinity mask = 00000004
> hint = 00000004
> 	msix index = 3, irq number =  358, cpu affinity mask = 00000008
> hint = 00000008
> 
> ~ Kashyap
> 
> >
> > Neil
> >