irqbalancer subset policy and CPU lock up on storage controller.

kashyap.desai@xxxxxxxxxxxxx (Kashyap Desai) · Wed, 14 Oct 2015 18:32:11 +0530

> -----Original Message-----
> From: Neil Horman [mailto:nhorman at tuxdriver.com]
> Sent: Wednesday, October 14, 2015 5:03 PM
> To: Kashyap Desai
> Cc: Neil Horman; irqbalance at lists.infradead.org; Peter Rivera; Mike Roy;
> Sreekanth Reddy
> Subject: Re: irqbalancer subset policy and CPU lock up on storage
> controller.
>
> On Wed, Oct 14, 2015 at 01:33:48PM +0530, Kashyap Desai wrote:
> > >
> > > Ok, your script looks ok (though I'm not sure The kernel you are
> > > using supports the existance of the driver directory in
> > > /proc/irq/<n>, so it
> > may
> > > not be functioning as expected).
> >
> > Yes kernel expose /proc/irq/<n>. So kernel has required interface.
> >
> No, Not /proc/irq/<n>, /proc/irq/<n>/<driver name>.  The former has
> always been there, the latter is fairly recent.

Yes, I see /proc/irq/<n>/<irq name>. Most of the driver use driver name
string in irq name at the time of "request_irq()".
In my case it is like  "/proc/irq/355/mpt3sas0-msix0". I think this is
what you mean.

>
>
> > >
> > > > From attached irqbalance debug output, I can see that irqbalancer
> > > > able to work as expected for policy script.
> > > >
> > > Agreed, it does seem to be functioning properly.  Further, I see
> > > that
> > those
> > > interrupts are properly assigned to a single core.  Which leads me
> > > to wonder exactly what is going on here. The debug output seems to
> > > contradict the affinity mask information you provided earlier
> > >
> > > > >
> > > > > > What is confusing me is - "cpu affinity mask" is just localize
> > > > > > to Numa
> > > > > > Node-0  as PCI device enumeration detected pci device is local
> > > > > > to numa_node 0.
> > > > > I really dont know what you mean by this.  Yes, your masks seem
> > > > > to be following what could be your numa node layout, but you're
> > > > > assuming (or
> > > > it
> > > > > sounds like you're assuming) that irqbalance is doing that
> > > > intentionally.  Its
> > > > > not, one of the above things is going on.
> > > > >
> > > > > >
> > > > > >
> > > > > > When you say "Driver does not participate in sysfs
> > > > > > enumeration" - Does it mean "numa_node" exposure in sysfs or
> > > > > > anything more than
> > > that ?
> > > > > > Sorry for basics and helping me to understand things.
> > > > > >
> > > > > I mean, does your driver register itself as a pci device?  If
> > > > > so, it
> > > > should have
> > > > > a directory in sysfs in /sys/bus/pci/<pci b:d:f>/.  As long as
> > > > > that
> > > > directory
> > > > > exists and is properly populated, irqbalance should have
> > > > > everything it
> > > > needs
> > > > > to properly assign a cpu to all of your irqs.
> > > >
> > > > Yes driver register device as pci device and I can see all
> > > > /sys/bus/pci/devices/ entry for mpt3sas attached device.
> > > >
> > > Agreed, your debug information bears that out.
> > >
> > > > > Note that the RHEL6 kernel did
> > > > > not always properly populate that directory.  I added sysfs code
> > > > > to
> > > > expose
> > > > > needed irq information in the kernel, and if you have an older
> > > > > kernel
> > > > and
> > > > > newer irqbalance, that might be part of the problem - another
> > > > > reason to contact oracle.
> > > > >
> > > > >
> > > > > another thing you can try is posting the output of irqbalance
> > > > > while
> > > > running
> > > > > it with -f and -d.  That will give us some insight as to what
> > > > > its doing
> > > > (note
> > > > > I'm referring here to upstream irqbalance, not the old version).
> > > > > And
> > > > you
> > > > > still didn't answer my question regarding the policyscript.
> > > >
> > > > I have attached irqbalance (latest from github last commit
> > > > 8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and
> policy
> > > script.
> > > >
> > > > Due to some reason, I have to move to different server which has 4
> > > > Numa socket.
> > > >
> > > > Here Is detail of my setup -
> > > >
> > > > [root]# lstopo-no-graphics
> > > > Machine (64GB)
> > > >   NUMANode L#0 (P#0 16GB)
> > > >     Socket L#0 + L3 L#0 (10MB)
> > > >       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
> > > >         PU L#0 (P#0)
> > > >         PU L#1 (P#16)
> > > >       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
> > > >         PU L#2 (P#1)
> > > >         PU L#3 (P#17)
> > > >       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
> > > >         PU L#4 (P#2)
> > > >         PU L#5 (P#18)
> > > >       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
> > > >         PU L#6 (P#3)
> > > >         PU L#7 (P#19)
> > > >     HostBridge L#0
> > > >       PCIBridge
> > > >         PCI 8086:0953
> > > >       PCIBridge
> > > >         PCI 1000:0097
> > > >           Block L#0 "sdb"
> > > >           Block L#1 "sdc"
> > > >           Block L#2 "sdd"
> > > >           Block L#3 "sde"
> > > >           Block L#4 "sdf"
> > > >           Block L#5 "sdg"
> > > >           Block L#6 "sdh"
> > > >       PCIBridge
> > > >         PCI 102b:0532
> > > >       PCI 8086:1d02
> > > >         Block L#7 "sda"
> > > >   NUMANode L#1 (P#1 16GB)
> > > >     Socket L#1 + L3 L#1 (10MB)
> > > >       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
> > > >         PU L#8 (P#4)
> > > >         PU L#9 (P#20)
> > > >       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
> > > >         PU L#10 (P#5)
> > > >         PU L#11 (P#21)
> > > >       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
> > > >         PU L#12 (P#6)
> > > >         PU L#13 (P#22)
> > > >       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
> > > >         PU L#14 (P#7)
> > > >         PU L#15 (P#23)
> > > >     HostBridge L#4
> > > >       PCIBridge
> > > >         PCI 1000:005b
> > > >   NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB)
> > > >     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
> > > >       PU L#16 (P#8)
> > > >       PU L#17 (P#24)
> > > >     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
> > > >       PU L#18 (P#9)
> > > >       PU L#19 (P#25)
> > > >     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core
L#10
> > > >       PU L#20 (P#10)
> > > >       PU L#21 (P#26)
> > > >     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core
L#11
> > > >       PU L#22 (P#11)
> > > >       PU L#23 (P#27)
> > > >   NUMANode L#3 (P#3 16GB)
> > > >     Socket L#3 + L3 L#3 (10MB)
> > > >       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core
L#12
> > > >         PU L#24 (P#12)
> > > >         PU L#25 (P#28)
> > > >       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core
L#13
> > > >         PU L#26 (P#13)
> > > >         PU L#27 (P#29)
> > > >       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core
L#14
> > > >         PU L#28 (P#14)
> > > >         PU L#29 (P#30)
> > > >       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core
L#15
> > > >         PU L#30 (P#15)
> > > >         PU L#31 (P#31)
> > > >     HostBridge L#6
> > > >       PCIBridge
> > > >         PCI 8086:1528
> > > >           Net L#8 "enp193s0f0"
> > > >         PCI 8086:1528
> > > >           Net L#9 "enp193s0f1"
> > > >       PCIBridge
> > > >         PCI 8086:0953
> > > >       PCIBridge
> > > >
> > > >
> > > > Things are bit clear now, but what I am seeing here is - "With
> > > > <ignore> and attached policy hint, CPU to MSIX mask is not same as
> > > > logic sequence of CPU #. It is random within core. I guess it
> > > > based on some link list in irqbalance, which I am not able to
> understand.
> > >
> > > What exactly do you mean by logic sequence of cpu #?  Are you under
> > > the impression that msix vectors should be mapped to their
> > > corresponding cpu number?
> >
> > In above output of lstopo-no-graphics with L# is what I mean by
> > logical sequential cpu.  From Driver's point of view we are looking
> > for parallel IO completion on CPUs so we always take logical CPU
> sequence in account.
> > What I am trying to highlight or understand is can we somehow populate
> > same affinity mask for each CPU as we usually see with <exact> policy
> > + Driver hinting ? That will be the one of the best configuration for
> > high end storage controller/device with multiple completion queue.
> >
> As I said, you can, but irqbalance will not do it on its own, because,
while
> you assert that its best for performance, thats not actually the case
(or
> rather a misguided interpretation of the case).  The reason its
beneficial is
> because higher layer software (in this case the block io completion
handler)
> is interrogating block io requests to ensure that completions occur on
the
> same cpu that the i/o was submitted on.  Irqbalance has no knoweldge of
> that, nor will it.  You can certainly write a policyscript that will
implement
> this (in fact it sounds like, given the hinting that the driver is
applying, you
> can just set hint_policy=exact for those specific interrupts), and be
done
> with it.

I did not notice this option. It's my bad that I did not install
irqbalance and worked on local directory.
I was watching old man page of irqbalance. I was looking for same as you
explained above.
Fine grain tuning per irq and hintpolicy=exact is good default for mpt3sas
driver.

I tried modifying script like below  and I got expected results.

                echo "numa_node=${node}";
                echo "hintpolicy=exact"

>
> But its important to note that, even in that case, it won't always work.
> Thats because the driver is hinting that irq vector 1 should affine to
cpu1, 2
> to 2, and so on, the underlying assumption being that irq vector 2 is
what
> triggers for i/o completion when a request submitted from cpu2 is done.
> That need not be the case (many network cards with storage functions
> integrated don't adhere to that at all).  You are welcome to write a
policy
> script to do what you are trying to do here (or simply assign exact hint
> policies globally if you like), but you can understand here I hope why
thats
> not going to become the default.
> What you observe as your desired behavior is far from a universal
constant.
>
>
> > > That is certainly not the case. irqbalance assigns irqs to cpus by
> > estimating
> > > how much cpu load each irq is responsible for, and assigning that
> > > irq to
> > the
> > > cpu that is the least loaded at the time of assignment (the idea
> > > being
> > that
> > > cpu load on each cpu is as close to equal as we can get it after
> > balancing).  If
> > > you have some additinoal need to place irqs on cpus such that some
> > higher
> > > level task exectues on the same cpu as the interrupt, it is the
> > responsibility
> > > of the higher level task to match the irq placement, not vice versa.
> > You can
> > > do some trickery to make them match up (i.e. do balacing via  a
> > > policy
> > script
> > > that assigns cores based on the static configuration of the higher
> > > level tasks), but such configurations are never going to be
> > > standard, and generally you need to have those higher level tasks
> > > follow the lower
> > level
> > > decisions.
> >
> > We are actually able to solve half problem with additional
> > --policyscript option if <ignore> is default policy. I mean half the
> > problem solve is - "Now driver always receives completion back to same
> > node which was part of submission. There is no cross numa node
> submission and completion"
> > Now all IO generated from Node-x always complete on Node-x. This is
> > good for storage HBA as we are mainly focusing on parallel completion
> > and not on Numa locality.
> >
> > Next level of issue which can be resolved via Kernel level tuning, but
> > I thought to inquire with irqbalancer expert if we can manage it with
> > <irqbalancer> or not.
> > Here is next level of problem. Your input will help me to understand
> > possibilities.
> >
> > On my setup Socket 0 has below layout.
> >
> > NUMANode L#0 (P#0 16GB)
> > 	Socket L#0 + L3 L#0 (10MB)
> > 		L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core
> > L#0
> > 			PU L#0 (P#0)
> > 			PU L#1 (P#16)
> > 		L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core
> > L#1
> > 			PU L#2 (P#1)
> > 			PU L#3 (P#17)
> > 		L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core
> > L#2
> > 			PU L#4 (P#2)
> > 			PU L#5 (P#18)
> > 		L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core
> > L#3
> > 			PU L#6 (P#3)
> > 			PU L#7 (P#19)
> >
> > For simplicity assume only node-0 is active in IO path.  Also special
> > case like L#0, L#2,L#4 and L#6 are keep busy is busy doing IO
> > submission. With latest <ignore> + --<policyscript> option, I see
> > completion is on
> > L#1,L#3,L#5 and L#7. This can cause IO latency as submitter is keep
> > sending which pile up work on other cpu until and unless completion is
> > forcefully not migrated to _exact_ cpu. This is currently tunable in
> > kernel via <rq_affinity> value 2.
> >
> > I am trying to understand if this can be done in <irqbalance> itself
> > to avoid <rq_affinity> settings in kernel. This way we just need to
> > tune only one component.
> >
> Sort of.  If you don't touch rq_affinity, and simply run irq blance such
that
> each vector gets a unique cpu, then the completion will execute on the
cpu
> that triggered the interrupt.  It won't necessecarily be the cpu that
> submitted the request, but I think thats actually far less important in
terms
> of performance.
>
> > From you below reply I understood, that assignment of cpu to msix
> > within numa/(balance level core) is based on cpu work load(coming from
> > cat
> > /proc/interrupts) at the time of assignment.
> >
> > "That is certainly not the case. irqbalance assigns irqs to cpus by
> > estimating how much cpu load each irq is responsible for, and
> > assigning that irq to the cpu that is the least loaded at the time of
> > assignment"
> >
> > Is it possible and useful if we want to bypass that policy and provide
> > <key/value> option ?. Once that option is used, <irqbalance> should
> > keep assigning driver's irq# in sequential order not based on
interrupt
> load.
> >
> It is possible, thats what the policyscript option is for.  But as noted
above,
> a blind sequential assignment isnt a universal benefit, and so not
something
> I'm going to make the default for irqbalance.  But sure, add a
policyscript,
> and you can assign irqs in anyway that you feel is best for your
specific
> workload.

Thank Neil.. Understood that default policy is good for deployments where
most of the improvement is due to localization NUMA node to h/w.  This is
what now irqbalancer is doing with default policy. I have everything I
wanted to solve my puzzle. Your expertize and active input help me a lot.
May be helpful for others who is looking for similar solution. Once again
thanks. !

>
> Neil
>
>