irqbalancer subset policy and CPU lock up on storage controller.

kashyap.desai@xxxxxxxxxxxxx (Kashyap Desai) · Wed, 14 Oct 2015 13:33:48 +0530

>
> Ok, your script looks ok (though I'm not sure The kernel you are using
> supports the existance of the driver directory in /proc/irq/<n>, so it
may
> not be functioning as expected).

Yes kernel expose /proc/irq/<n>. So kernel has required interface.

>
> > From attached irqbalance debug output, I can see that irqbalancer able
> > to work as expected for policy script.
> >
> Agreed, it does seem to be functioning properly.  Further, I see that
those
> interrupts are properly assigned to a single core.  Which leads me to
> wonder exactly what is going on here. The debug output seems to
> contradict the affinity mask information you provided earlier
>
> > >
> > > > What is confusing me is - "cpu affinity mask" is just localize to
> > > > Numa
> > > > Node-0  as PCI device enumeration detected pci device is local to
> > > > numa_node 0.
> > > I really dont know what you mean by this.  Yes, your masks seem to
> > > be following what could be your numa node layout, but you're
> > > assuming (or
> > it
> > > sounds like you're assuming) that irqbalance is doing that
> > intentionally.  Its
> > > not, one of the above things is going on.
> > >
> > > >
> > > >
> > > > When you say "Driver does not participate in sysfs enumeration" -
> > > > Does it mean "numa_node" exposure in sysfs or anything more than
> that ?
> > > > Sorry for basics and helping me to understand things.
> > > >
> > > I mean, does your driver register itself as a pci device?  If so, it
> > should have
> > > a directory in sysfs in /sys/bus/pci/<pci b:d:f>/.  As long as that
> > directory
> > > exists and is properly populated, irqbalance should have everything
> > > it
> > needs
> > > to properly assign a cpu to all of your irqs.
> >
> > Yes driver register device as pci device and I can see all
> > /sys/bus/pci/devices/ entry for mpt3sas attached device.
> >
> Agreed, your debug information bears that out.
>
> > > Note that the RHEL6 kernel did
> > > not always properly populate that directory.  I added sysfs code to
> > expose
> > > needed irq information in the kernel, and if you have an older
> > > kernel
> > and
> > > newer irqbalance, that might be part of the problem - another reason
> > > to contact oracle.
> > >
> > >
> > > another thing you can try is posting the output of irqbalance while
> > running
> > > it with -f and -d.  That will give us some insight as to what its
> > > doing
> > (note
> > > I'm referring here to upstream irqbalance, not the old version).
> > > And
> > you
> > > still didn't answer my question regarding the policyscript.
> >
> > I have attached irqbalance (latest from github last commit
> > 8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and policy
> script.
> >
> > Due to some reason, I have to move to different server which has 4
> > Numa socket.
> >
> > Here Is detail of my setup -
> >
> > [root]# lstopo-no-graphics
> > Machine (64GB)
> >   NUMANode L#0 (P#0 16GB)
> >     Socket L#0 + L3 L#0 (10MB)
> >       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
> >         PU L#0 (P#0)
> >         PU L#1 (P#16)
> >       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
> >         PU L#2 (P#1)
> >         PU L#3 (P#17)
> >       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
> >         PU L#4 (P#2)
> >         PU L#5 (P#18)
> >       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
> >         PU L#6 (P#3)
> >         PU L#7 (P#19)
> >     HostBridge L#0
> >       PCIBridge
> >         PCI 8086:0953
> >       PCIBridge
> >         PCI 1000:0097
> >           Block L#0 "sdb"
> >           Block L#1 "sdc"
> >           Block L#2 "sdd"
> >           Block L#3 "sde"
> >           Block L#4 "sdf"
> >           Block L#5 "sdg"
> >           Block L#6 "sdh"
> >       PCIBridge
> >         PCI 102b:0532
> >       PCI 8086:1d02
> >         Block L#7 "sda"
> >   NUMANode L#1 (P#1 16GB)
> >     Socket L#1 + L3 L#1 (10MB)
> >       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
> >         PU L#8 (P#4)
> >         PU L#9 (P#20)
> >       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
> >         PU L#10 (P#5)
> >         PU L#11 (P#21)
> >       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
> >         PU L#12 (P#6)
> >         PU L#13 (P#22)
> >       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
> >         PU L#14 (P#7)
> >         PU L#15 (P#23)
> >     HostBridge L#4
> >       PCIBridge
> >         PCI 1000:005b
> >   NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB)
> >     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
> >       PU L#16 (P#8)
> >       PU L#17 (P#24)
> >     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
> >       PU L#18 (P#9)
> >       PU L#19 (P#25)
> >     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
> >       PU L#20 (P#10)
> >       PU L#21 (P#26)
> >     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
> >       PU L#22 (P#11)
> >       PU L#23 (P#27)
> >   NUMANode L#3 (P#3 16GB)
> >     Socket L#3 + L3 L#3 (10MB)
> >       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
> >         PU L#24 (P#12)
> >         PU L#25 (P#28)
> >       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
> >         PU L#26 (P#13)
> >         PU L#27 (P#29)
> >       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
> >         PU L#28 (P#14)
> >         PU L#29 (P#30)
> >       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
> >         PU L#30 (P#15)
> >         PU L#31 (P#31)
> >     HostBridge L#6
> >       PCIBridge
> >         PCI 8086:1528
> >           Net L#8 "enp193s0f0"
> >         PCI 8086:1528
> >           Net L#9 "enp193s0f1"
> >       PCIBridge
> >         PCI 8086:0953
> >       PCIBridge
> >
> >
> > Things are bit clear now, but what I am seeing here is - "With
> > <ignore> and attached policy hint, CPU to MSIX mask is not same as
> > logic sequence of CPU #. It is random within core. I guess it based on
> > some link list in irqbalance, which I am not able to understand.
>
> What exactly do you mean by logic sequence of cpu #?  Are you under the
> impression that msix vectors should be mapped to their corresponding cpu
> number?

In above output of lstopo-no-graphics with L# is what I mean by logical
sequential cpu.  From Driver's point of view we are looking for parallel
IO completion on CPUs so we always take logical CPU sequence in account.
What I am trying to highlight or understand is can we somehow populate
same affinity mask for each CPU as we usually see with <exact> policy +
Driver hinting ? That will be the one of the best configuration for high
end storage controller/device with multiple completion queue.

> That is certainly not the case. irqbalance assigns irqs to cpus by
estimating
> how much cpu load each irq is responsible for, and assigning that irq to
the
> cpu that is the least loaded at the time of assignment (the idea being
that
> cpu load on each cpu is as close to equal as we can get it after
balancing).  If
> you have some additinoal need to place irqs on cpus such that some
higher
> level task exectues on the same cpu as the interrupt, it is the
responsibility
> of the higher level task to match the irq placement, not vice versa.
You can
> do some trickery to make them match up (i.e. do balacing via  a policy
script
> that assigns cores based on the static configuration of the higher level
> tasks), but such configurations are never going to be standard, and
> generally you need to have those higher level tasks follow the lower
level
> decisions.

We are actually able to solve half problem with additional --policyscript
option if <ignore> is default policy. I mean half the problem solve is -
"Now driver always receives completion back to same node which was part of
submission. There is no cross numa node submission and completion"
Now all IO generated from Node-x always complete on Node-x. This is good
for storage HBA as we are mainly focusing on parallel completion and not
on Numa locality.

Next level of issue which can be resolved via Kernel level tuning, but I
thought to inquire with irqbalancer expert if we can manage it with
<irqbalancer> or not.
Here is next level of problem. Your input will help me to understand
possibilities.

On my setup Socket 0 has below layout.

NUMANode L#0 (P#0 16GB)
	Socket L#0 + L3 L#0 (10MB)
		L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core
L#0
			PU L#0 (P#0)
			PU L#1 (P#16)
		L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core
L#1
			PU L#2 (P#1)
			PU L#3 (P#17)
		L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core
L#2
			PU L#4 (P#2)
			PU L#5 (P#18)
		L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core
L#3
			PU L#6 (P#3)
			PU L#7 (P#19)

For simplicity assume only node-0 is active in IO path.  Also special case
like L#0, L#2,L#4 and L#6 are keep busy is busy doing IO submission. With
latest <ignore> + --<policyscript> option, I see completion is on
L#1,L#3,L#5 and L#7. This can cause IO latency as submitter is keep
sending which pile up work on other cpu until and unless completion is
forcefully not migrated to _exact_ cpu. This is currently tunable in
kernel via <rq_affinity> value 2.

I am trying to understand if this can be done in <irqbalance> itself to
avoid <rq_affinity> settings in kernel. This way we just need to tune only
one component.

>From you below reply I understood, that assignment of cpu to msix within
numa/(balance level core) is based on cpu work load(coming from cat
/proc/interrupts) at the time of assignment.

"That is certainly not the case. irqbalance assigns irqs to cpus by
estimating
how much cpu load each irq is responsible for, and assigning that irq to
the
cpu that is the least loaded at the time of assignment"

Is it possible and useful if we want to bypass that policy and provide
<key/value> option ?. Once that option is used, <irqbalance> should keep
assigning driver's irq# in sequential order not based on interrupt load.

Thanks for trying to digest my queries and providing all your technical
inputs.

>
> > E.a  Below piece of code hint that per Core, irq numbers are stored in
> > link list "d->interrupts". This list is not based on sequential
> > traverse, rather based on how interrupts are generated. Right ?
> >
> As noted above, its a measurement of approximate cpu load generated by
> each interrupt.  As far as assignment goes, there is no relevance to
which
> cpu handles an interrupt within a numa node from the standpoint of the
> interrupt itself.  If you have some requirement to align interrupts with
other
> execution contexts, then you either need to write you poilcy script to
> understand those higher level configurations, or you need to modify your
> higher level setup to follow what irqbalance does.
>
>
> > static void dump_cache_domain(struct topo_obj *d, void *data) {
> >         char *buffer = data;
> >         cpumask_scnprintf(buffer, 4095, d->mask);
> >         log(TO_CONSOLE, LOG_INFO, "%s%sCache domain %i:  numa_node is
> > %d cpu mask is %s  (load %lu) \n",
> >             log_indent, log_indent,
> >             d->number, cache_domain_numa_node(d)->number, buffer,
> > (unsigned long)d->load);
> >         if (d->children)
> >                 for_each_object(d->children, dump_balance_obj, NULL);
> >         if (g_list_length(d->interrupts) > 0)
> >                 for_each_irq(d->interrupts, dump_irq, (void *)10); }
> >
> > I see sometimes different cpu masks  as below snippet -   cpu mask on
my
> > setup varies on run to run, but good thing is mask is within <core>,
> > but not like <exact>.
> >
> They're not going to be the same, I'm not sure why that is so hard to
> understand.  All affinity_hint is is a drivers best guess as to where to
put an
> irq, and its, as a rule, sub-optimal.  Expecting irqbalance to arrive at
the
> same decision as affinity_hint independently is improper reasoning.
>
> > 	msix index = 0, irq number =  355, cpu affinity mask = 00000008
> hint
> > = 00000001
> > 	msix index = 1, irq number =  356, cpu affinity mask = 00000004
> hint
> > = 00000002
> > 	msix index = 2, irq number =  357, cpu affinity mask = 00000002
> hint
> > = 00000004
> > 	msix index = 3, irq number =  358, cpu affinity mask = 00000001
> hint
> > = 00000008
> >
> >
> > 	msix index = 0, irq number =  355, cpu affinity mask = 00000002
> hint
> > = 00000001
> > 	msix index = 1, irq number =  356, cpu affinity mask = 00000008
> hint
> > = 00000002
> > 	msix index = 2, irq number =  357, cpu affinity mask = 00000004
> hint
> > = 00000004
> > 	msix index = 3, irq number =  358, cpu affinity mask = 00000001
> hint
> > = 00000008
> >
> > I am expecting  as below because once <mpt3sas> driver send IO it hint
> > FW about completion queue. E.a if IO is submitted from logical CPU #X,
> > driver use smp_process_id() to get that logical CPU #X and expect
> > completion on same CPU for better performance.  Is this expectation
> > possible with existing latest <irqbalance> ?
> >
> It is, but only through the use of a policy script that you write to
guarantee
> that assignment.  With the information that irqbalance has at hand,
there is
> no need to balance irqs in the below manner.  Even if there is some
upper
> layer indicator that suggests it might be beneficial, theres nothing
that
> irqbalance can use to consistently determine that.  You can certainly do
it
> with a policy script if you like in any number of ways, but any logic
you put
> in that script is by and large going to remain there (i.e. its not going
to
> become part of the default behavior, unless you can provide:
>
> a) a mechanism to consistently query devices to determine if they should
> be balanced in this manner
>
> b) evidence that for all devices of that class, this provides a
performance
> benefit.
>
> In other words, you have to show me which irqs this is a benefit for,
and
> how to tell if an arbitrary irq fits that profile.
>
> Note also, that, based on what you're saying above, I don't think this
is
> going to provide a consistent benefit.  It sounds like what you're
suggesting
> is that the affinity hint of the mpt3sas driver assigns hints based on
what
> cpu it expects io/requests to come in on (thereby matching completion
> interrupts with the submitted io data).  But looking at the code:
>
> a) there doesn't appear to be any assignment like that in mpt3sas, its
just
> blindly assigning hints based on the online cpus at the time the device
was
> registered.  From the mpt3sas driver in _base_assing_reply_queues:
>
> list_for_each_entry(reply_q, &ioc->reply_queue_list, list) {
>
>                 unsigned int i, group = nr_cpus / nr_msix;
>
>                 if (cpu >= nr_cpus)
>                         break;
>
>                 if (index < nr_cpus % nr_msix)
>                         group++;
>
>                 for (i = 0 ; i < group ; i++) {
>                         ioc->cpu_msix_table[cpu] = index;
>                         cpumask_or(reply_q->affinity_hint,
>                                    reply_q->affinity_hint,
get_cpu_mask(cpu));
>                         cpu = cpumask_next(cpu, cpu_online_mask);
>                 }
>
>                 if (irq_set_affinity_hint(reply_q->vector,
>                                            reply_q->affinity_hint))
>                         dinitprintk(ioc, pr_info(MPT3SAS_FMT
>                             "error setting affinity hint for irq vector
%d\n",
>                             ioc->name, reply_q->vector));
>                 index++;
> }
>
> and
> b) There is no guarantee that, for any given i/o, it will be submitted
on a cpu
> that is in a mask assigned by the affinity hint.

Half of the logic to meet this requirement is in HBA firmware. They will
know which msix index to be used for completion.
So all of simple equation is map all logical CPU to #of completion queue
in h/w. At the same time cpu to msix mapping also needs same
interpretation.
That was working as expected in <exact> policy and we are trying to meet
the same using <ignore> and <--policyscript>

>
> Neil
>
>
> > 	msix index = 0, irq number =  355, cpu affinity mask = 00000001
> hint
> > = 00000001
> > 	msix index = 1, irq number =  356, cpu affinity mask = 00000002
> hint
> > = 00000002
> > 	msix index = 2, irq number =  357, cpu affinity mask = 00000004
> hint
> > = 00000004
> > 	msix index = 3, irq number =  358, cpu affinity mask = 00000008
> hint
> > = 00000008
> >
> > ~ Kashyap
> >
> > >
> > > Neil
> > >
>
>