irqbalancer subset policy and CPU lock up on storage controller.

nhorman@xxxxxxxxxxxxx (Neil Horman) · Wed, 14 Oct 2015 07:33:07 -0400

On Wed, Oct 14, 2015 at 01:33:48PM +0530, Kashyap Desai wrote:
> >
> > Ok, your script looks ok (though I'm not sure The kernel you are using
> > supports the existance of the driver directory in /proc/irq/<n>, so it
> may
> > not be functioning as expected).
> 
> Yes kernel expose /proc/irq/<n>. So kernel has required interface.
> 
No, Not /proc/irq/<n>, /proc/irq/<n>/<driver name>.  The former has always been
there, the latter is fairly recent.

> >
> > > From attached irqbalance debug output, I can see that irqbalancer able
> > > to work as expected for policy script.
> > >
> > Agreed, it does seem to be functioning properly.  Further, I see that
> those
> > interrupts are properly assigned to a single core.  Which leads me to
> > wonder exactly what is going on here. The debug output seems to
> > contradict the affinity mask information you provided earlier
> >
> > > >
> > > > > What is confusing me is - "cpu affinity mask" is just localize to
> > > > > Numa
> > > > > Node-0  as PCI device enumeration detected pci device is local to
> > > > > numa_node 0.
> > > > I really dont know what you mean by this.  Yes, your masks seem to
> > > > be following what could be your numa node layout, but you're
> > > > assuming (or
> > > it
> > > > sounds like you're assuming) that irqbalance is doing that
> > > intentionally.  Its
> > > > not, one of the above things is going on.
> > > >
> > > > >
> > > > >
> > > > > When you say "Driver does not participate in sysfs enumeration" -
> > > > > Does it mean "numa_node" exposure in sysfs or anything more than
> > that ?
> > > > > Sorry for basics and helping me to understand things.
> > > > >
> > > > I mean, does your driver register itself as a pci device?  If so, it
> > > should have
> > > > a directory in sysfs in /sys/bus/pci/<pci b:d:f>/.  As long as that
> > > directory
> > > > exists and is properly populated, irqbalance should have everything
> > > > it
> > > needs
> > > > to properly assign a cpu to all of your irqs.
> > >
> > > Yes driver register device as pci device and I can see all
> > > /sys/bus/pci/devices/ entry for mpt3sas attached device.
> > >
> > Agreed, your debug information bears that out.
> >
> > > > Note that the RHEL6 kernel did
> > > > not always properly populate that directory.  I added sysfs code to
> > > expose
> > > > needed irq information in the kernel, and if you have an older
> > > > kernel
> > > and
> > > > newer irqbalance, that might be part of the problem - another reason
> > > > to contact oracle.
> > > >
> > > >
> > > > another thing you can try is posting the output of irqbalance while
> > > running
> > > > it with -f and -d.  That will give us some insight as to what its
> > > > doing
> > > (note
> > > > I'm referring here to upstream irqbalance, not the old version).
> > > > And
> > > you
> > > > still didn't answer my question regarding the policyscript.
> > >
> > > I have attached irqbalance (latest from github last commit
> > > 8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and policy
> > script.
> > >
> > > Due to some reason, I have to move to different server which has 4
> > > Numa socket.
> > >
> > > Here Is detail of my setup -
> > >
> > > [root]# lstopo-no-graphics
> > > Machine (64GB)
> > >   NUMANode L#0 (P#0 16GB)
> > >     Socket L#0 + L3 L#0 (10MB)
> > >       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
> > >         PU L#0 (P#0)
> > >         PU L#1 (P#16)
> > >       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
> > >         PU L#2 (P#1)
> > >         PU L#3 (P#17)
> > >       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
> > >         PU L#4 (P#2)
> > >         PU L#5 (P#18)
> > >       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
> > >         PU L#6 (P#3)
> > >         PU L#7 (P#19)
> > >     HostBridge L#0
> > >       PCIBridge
> > >         PCI 8086:0953
> > >       PCIBridge
> > >         PCI 1000:0097
> > >           Block L#0 "sdb"
> > >           Block L#1 "sdc"
> > >           Block L#2 "sdd"
> > >           Block L#3 "sde"
> > >           Block L#4 "sdf"
> > >           Block L#5 "sdg"
> > >           Block L#6 "sdh"
> > >       PCIBridge
> > >         PCI 102b:0532
> > >       PCI 8086:1d02
> > >         Block L#7 "sda"
> > >   NUMANode L#1 (P#1 16GB)
> > >     Socket L#1 + L3 L#1 (10MB)
> > >       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
> > >         PU L#8 (P#4)
> > >         PU L#9 (P#20)
> > >       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
> > >         PU L#10 (P#5)
> > >         PU L#11 (P#21)
> > >       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
> > >         PU L#12 (P#6)
> > >         PU L#13 (P#22)
> > >       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
> > >         PU L#14 (P#7)
> > >         PU L#15 (P#23)
> > >     HostBridge L#4
> > >       PCIBridge
> > >         PCI 1000:005b
> > >   NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB)
> > >     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
> > >       PU L#16 (P#8)
> > >       PU L#17 (P#24)
> > >     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
> > >       PU L#18 (P#9)
> > >       PU L#19 (P#25)
> > >     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
> > >       PU L#20 (P#10)
> > >       PU L#21 (P#26)
> > >     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
> > >       PU L#22 (P#11)
> > >       PU L#23 (P#27)
> > >   NUMANode L#3 (P#3 16GB)
> > >     Socket L#3 + L3 L#3 (10MB)
> > >       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
> > >         PU L#24 (P#12)
> > >         PU L#25 (P#28)
> > >       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
> > >         PU L#26 (P#13)
> > >         PU L#27 (P#29)
> > >       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
> > >         PU L#28 (P#14)
> > >         PU L#29 (P#30)
> > >       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
> > >         PU L#30 (P#15)
> > >         PU L#31 (P#31)
> > >     HostBridge L#6
> > >       PCIBridge
> > >         PCI 8086:1528
> > >           Net L#8 "enp193s0f0"
> > >         PCI 8086:1528
> > >           Net L#9 "enp193s0f1"
> > >       PCIBridge
> > >         PCI 8086:0953
> > >       PCIBridge
> > >
> > >
> > > Things are bit clear now, but what I am seeing here is - "With
> > > <ignore> and attached policy hint, CPU to MSIX mask is not same as
> > > logic sequence of CPU #. It is random within core. I guess it based on
> > > some link list in irqbalance, which I am not able to understand.
> >
> > What exactly do you mean by logic sequence of cpu #?  Are you under the
> > impression that msix vectors should be mapped to their corresponding cpu
> > number?
> 
> In above output of lstopo-no-graphics with L# is what I mean by logical
> sequential cpu.  From Driver's point of view we are looking for parallel
> IO completion on CPUs so we always take logical CPU sequence in account.
> What I am trying to highlight or understand is can we somehow populate
> same affinity mask for each CPU as we usually see with <exact> policy +
> Driver hinting ? That will be the one of the best configuration for high
> end storage controller/device with multiple completion queue.
> 
As I said, you can, but irqbalance will not do it on its own, because, while you
assert that its best for performance, thats not actually the case (or rather a
misguided interpretation of the case).  The reason its beneficial is because
higher layer software (in this case the block io completion handler) is
interrogating block io requests to ensure that completions occur on the same cpu
that the i/o was submitted on.  Irqbalance has no knoweldge of that, nor will
it.  You can certainly write a policyscript that will implement this (in fact it
sounds like, given the hinting that the driver is applying, you can just set
hint_policy=exact for those specific interrupts), and be done with it.

But its important to note that, even in that case, it won't always work.  Thats
because the driver is hinting that irq vector 1 should affine to cpu1, 2 to 2,
and so on, the underlying assumption being that irq vector 2 is what triggers
for i/o completion when a request submitted from cpu2 is done.  That need not be
the case (many network cards with storage functions integrated don't adhere to
that at all).  You are welcome to write a policy script to do what you are
trying to do here (or simply assign exact hint policies globally if you like),
but you can understand here I hope why thats not going to become the default.
What you observe as your desired behavior is far from a universal constant.

> > That is certainly not the case. irqbalance assigns irqs to cpus by
> estimating
> > how much cpu load each irq is responsible for, and assigning that irq to
> the
> > cpu that is the least loaded at the time of assignment (the idea being
> that
> > cpu load on each cpu is as close to equal as we can get it after
> balancing).  If
> > you have some additinoal need to place irqs on cpus such that some
> higher
> > level task exectues on the same cpu as the interrupt, it is the
> responsibility
> > of the higher level task to match the irq placement, not vice versa.
> You can
> > do some trickery to make them match up (i.e. do balacing via  a policy
> script
> > that assigns cores based on the static configuration of the higher level
> > tasks), but such configurations are never going to be standard, and
> > generally you need to have those higher level tasks follow the lower
> level
> > decisions.
> 
> We are actually able to solve half problem with additional --policyscript
> option if <ignore> is default policy. I mean half the problem solve is -
> "Now driver always receives completion back to same node which was part of
> submission. There is no cross numa node submission and completion"
> Now all IO generated from Node-x always complete on Node-x. This is good
> for storage HBA as we are mainly focusing on parallel completion and not
> on Numa locality.
> 
> Next level of issue which can be resolved via Kernel level tuning, but I
> thought to inquire with irqbalancer expert if we can manage it with
> <irqbalancer> or not.
> Here is next level of problem. Your input will help me to understand
> possibilities.
> 
> On my setup Socket 0 has below layout.
> 
> NUMANode L#0 (P#0 16GB)
> 	Socket L#0 + L3 L#0 (10MB)
> 		L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core
> L#0
> 			PU L#0 (P#0)
> 			PU L#1 (P#16)
> 		L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core
> L#1
> 			PU L#2 (P#1)
> 			PU L#3 (P#17)
> 		L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core
> L#2
> 			PU L#4 (P#2)
> 			PU L#5 (P#18)
> 		L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core
> L#3
> 			PU L#6 (P#3)
> 			PU L#7 (P#19)
> 
> For simplicity assume only node-0 is active in IO path.  Also special case
> like L#0, L#2,L#4 and L#6 are keep busy is busy doing IO submission. With
> latest <ignore> + --<policyscript> option, I see completion is on
> L#1,L#3,L#5 and L#7. This can cause IO latency as submitter is keep
> sending which pile up work on other cpu until and unless completion is
> forcefully not migrated to _exact_ cpu. This is currently tunable in
> kernel via <rq_affinity> value 2.
> 
> I am trying to understand if this can be done in <irqbalance> itself to
> avoid <rq_affinity> settings in kernel. This way we just need to tune only
> one component.
> 
Sort of.  If you don't touch rq_affinity, and simply run irq blance such that
each vector gets a unique cpu, then the completion will execute on the cpu that
triggered the interrupt.  It won't necessecarily be the cpu that submitted the
request, but I think thats actually far less important in terms of performance.

> From you below reply I understood, that assignment of cpu to msix within
> numa/(balance level core) is based on cpu work load(coming from cat
> /proc/interrupts) at the time of assignment.
> 
> "That is certainly not the case. irqbalance assigns irqs to cpus by
> estimating
> how much cpu load each irq is responsible for, and assigning that irq to
> the
> cpu that is the least loaded at the time of assignment"
> 
> Is it possible and useful if we want to bypass that policy and provide
> <key/value> option ?. Once that option is used, <irqbalance> should keep
> assigning driver's irq# in sequential order not based on interrupt load.
> 
It is possible, thats what the policyscript option is for.  But as noted above,
a blind sequential assignment isnt a universal benefit, and so not something I'm
going to make the default for irqbalance.  But sure, add a policyscript, and you
can assign irqs in anyway that you feel is best for your specific workload.

Neil