> -----Original Message----- > From: Neil Horman [mailto:nhorman at tuxdriver.com] > Sent: Wednesday, October 14, 2015 5:03 PM > To: Kashyap Desai > Cc: Neil Horman; irqbalance at lists.infradead.org; Peter Rivera; Mike Roy; > Sreekanth Reddy > Subject: Re: irqbalancer subset policy and CPU lock up on storage > controller. > > On Wed, Oct 14, 2015 at 01:33:48PM +0530, Kashyap Desai wrote: > > > > > > Ok, your script looks ok (though I'm not sure The kernel you are > > > using supports the existance of the driver directory in > > > /proc/irq/<n>, so it > > may > > > not be functioning as expected). > > > > Yes kernel expose /proc/irq/<n>. So kernel has required interface. > > > No, Not /proc/irq/<n>, /proc/irq/<n>/<driver name>. The former has > always been there, the latter is fairly recent. Yes, I see /proc/irq/<n>/<irq name>. Most of the driver use driver name string in irq name at the time of "request_irq()". In my case it is like "/proc/irq/355/mpt3sas0-msix0". I think this is what you mean. > > > > > > > > > From attached irqbalance debug output, I can see that irqbalancer > > > > able to work as expected for policy script. > > > > > > > Agreed, it does seem to be functioning properly. Further, I see > > > that > > those > > > interrupts are properly assigned to a single core. Which leads me > > > to wonder exactly what is going on here. The debug output seems to > > > contradict the affinity mask information you provided earlier > > > > > > > > > > > > > > What is confusing me is - "cpu affinity mask" is just localize > > > > > > to Numa > > > > > > Node-0 as PCI device enumeration detected pci device is local > > > > > > to numa_node 0. > > > > > I really dont know what you mean by this. Yes, your masks seem > > > > > to be following what could be your numa node layout, but you're > > > > > assuming (or > > > > it > > > > > sounds like you're assuming) that irqbalance is doing that > > > > intentionally. Its > > > > > not, one of the above things is going on. > > > > > > > > > > > > > > > > > > > > > > > When you say "Driver does not participate in sysfs > > > > > > enumeration" - Does it mean "numa_node" exposure in sysfs or > > > > > > anything more than > > > that ? > > > > > > Sorry for basics and helping me to understand things. > > > > > > > > > > > I mean, does your driver register itself as a pci device? If > > > > > so, it > > > > should have > > > > > a directory in sysfs in /sys/bus/pci/<pci b:d:f>/. As long as > > > > > that > > > > directory > > > > > exists and is properly populated, irqbalance should have > > > > > everything it > > > > needs > > > > > to properly assign a cpu to all of your irqs. > > > > > > > > Yes driver register device as pci device and I can see all > > > > /sys/bus/pci/devices/ entry for mpt3sas attached device. > > > > > > > Agreed, your debug information bears that out. > > > > > > > > Note that the RHEL6 kernel did > > > > > not always properly populate that directory. I added sysfs code > > > > > to > > > > expose > > > > > needed irq information in the kernel, and if you have an older > > > > > kernel > > > > and > > > > > newer irqbalance, that might be part of the problem - another > > > > > reason to contact oracle. > > > > > > > > > > > > > > > another thing you can try is posting the output of irqbalance > > > > > while > > > > running > > > > > it with -f and -d. That will give us some insight as to what > > > > > its doing > > > > (note > > > > > I'm referring here to upstream irqbalance, not the old version). > > > > > And > > > > you > > > > > still didn't answer my question regarding the policyscript. > > > > > > > > I have attached irqbalance (latest from github last commit > > > > 8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and > policy > > > script. > > > > > > > > Due to some reason, I have to move to different server which has 4 > > > > Numa socket. > > > > > > > > Here Is detail of my setup - > > > > > > > > [root]# lstopo-no-graphics > > > > Machine (64GB) > > > > NUMANode L#0 (P#0 16GB) > > > > Socket L#0 + L3 L#0 (10MB) > > > > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > > > > PU L#0 (P#0) > > > > PU L#1 (P#16) > > > > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > > > > PU L#2 (P#1) > > > > PU L#3 (P#17) > > > > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > > > > PU L#4 (P#2) > > > > PU L#5 (P#18) > > > > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > > > > PU L#6 (P#3) > > > > PU L#7 (P#19) > > > > HostBridge L#0 > > > > PCIBridge > > > > PCI 8086:0953 > > > > PCIBridge > > > > PCI 1000:0097 > > > > Block L#0 "sdb" > > > > Block L#1 "sdc" > > > > Block L#2 "sdd" > > > > Block L#3 "sde" > > > > Block L#4 "sdf" > > > > Block L#5 "sdg" > > > > Block L#6 "sdh" > > > > PCIBridge > > > > PCI 102b:0532 > > > > PCI 8086:1d02 > > > > Block L#7 "sda" > > > > NUMANode L#1 (P#1 16GB) > > > > Socket L#1 + L3 L#1 (10MB) > > > > L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 > > > > PU L#8 (P#4) > > > > PU L#9 (P#20) > > > > L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 > > > > PU L#10 (P#5) > > > > PU L#11 (P#21) > > > > L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 > > > > PU L#12 (P#6) > > > > PU L#13 (P#22) > > > > L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 > > > > PU L#14 (P#7) > > > > PU L#15 (P#23) > > > > HostBridge L#4 > > > > PCIBridge > > > > PCI 1000:005b > > > > NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB) > > > > L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 > > > > PU L#16 (P#8) > > > > PU L#17 (P#24) > > > > L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 > > > > PU L#18 (P#9) > > > > PU L#19 (P#25) > > > > L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 > > > > PU L#20 (P#10) > > > > PU L#21 (P#26) > > > > L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 > > > > PU L#22 (P#11) > > > > PU L#23 (P#27) > > > > NUMANode L#3 (P#3 16GB) > > > > Socket L#3 + L3 L#3 (10MB) > > > > L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 > > > > PU L#24 (P#12) > > > > PU L#25 (P#28) > > > > L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 > > > > PU L#26 (P#13) > > > > PU L#27 (P#29) > > > > L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 > > > > PU L#28 (P#14) > > > > PU L#29 (P#30) > > > > L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 > > > > PU L#30 (P#15) > > > > PU L#31 (P#31) > > > > HostBridge L#6 > > > > PCIBridge > > > > PCI 8086:1528 > > > > Net L#8 "enp193s0f0" > > > > PCI 8086:1528 > > > > Net L#9 "enp193s0f1" > > > > PCIBridge > > > > PCI 8086:0953 > > > > PCIBridge > > > > > > > > > > > > Things are bit clear now, but what I am seeing here is - "With > > > > <ignore> and attached policy hint, CPU to MSIX mask is not same as > > > > logic sequence of CPU #. It is random within core. I guess it > > > > based on some link list in irqbalance, which I am not able to > understand. > > > > > > What exactly do you mean by logic sequence of cpu #? Are you under > > > the impression that msix vectors should be mapped to their > > > corresponding cpu number? > > > > In above output of lstopo-no-graphics with L# is what I mean by > > logical sequential cpu. From Driver's point of view we are looking > > for parallel IO completion on CPUs so we always take logical CPU > sequence in account. > > What I am trying to highlight or understand is can we somehow populate > > same affinity mask for each CPU as we usually see with <exact> policy > > + Driver hinting ? That will be the one of the best configuration for > > high end storage controller/device with multiple completion queue. > > > As I said, you can, but irqbalance will not do it on its own, because, while > you assert that its best for performance, thats not actually the case (or > rather a misguided interpretation of the case). The reason its beneficial is > because higher layer software (in this case the block io completion handler) > is interrogating block io requests to ensure that completions occur on the > same cpu that the i/o was submitted on. Irqbalance has no knoweldge of > that, nor will it. You can certainly write a policyscript that will implement > this (in fact it sounds like, given the hinting that the driver is applying, you > can just set hint_policy=exact for those specific interrupts), and be done > with it. I did not notice this option. It's my bad that I did not install irqbalance and worked on local directory. I was watching old man page of irqbalance. I was looking for same as you explained above. Fine grain tuning per irq and hintpolicy=exact is good default for mpt3sas driver. I tried modifying script like below and I got expected results. echo "numa_node=${node}"; echo "hintpolicy=exact" > > But its important to note that, even in that case, it won't always work. > Thats because the driver is hinting that irq vector 1 should affine to cpu1, 2 > to 2, and so on, the underlying assumption being that irq vector 2 is what > triggers for i/o completion when a request submitted from cpu2 is done. > That need not be the case (many network cards with storage functions > integrated don't adhere to that at all). You are welcome to write a policy > script to do what you are trying to do here (or simply assign exact hint > policies globally if you like), but you can understand here I hope why thats > not going to become the default. > What you observe as your desired behavior is far from a universal constant. > > > > > That is certainly not the case. irqbalance assigns irqs to cpus by > > estimating > > > how much cpu load each irq is responsible for, and assigning that > > > irq to > > the > > > cpu that is the least loaded at the time of assignment (the idea > > > being > > that > > > cpu load on each cpu is as close to equal as we can get it after > > balancing). If > > > you have some additinoal need to place irqs on cpus such that some > > higher > > > level task exectues on the same cpu as the interrupt, it is the > > responsibility > > > of the higher level task to match the irq placement, not vice versa. > > You can > > > do some trickery to make them match up (i.e. do balacing via a > > > policy > > script > > > that assigns cores based on the static configuration of the higher > > > level tasks), but such configurations are never going to be > > > standard, and generally you need to have those higher level tasks > > > follow the lower > > level > > > decisions. > > > > We are actually able to solve half problem with additional > > --policyscript option if <ignore> is default policy. I mean half the > > problem solve is - "Now driver always receives completion back to same > > node which was part of submission. There is no cross numa node > submission and completion" > > Now all IO generated from Node-x always complete on Node-x. This is > > good for storage HBA as we are mainly focusing on parallel completion > > and not on Numa locality. > > > > Next level of issue which can be resolved via Kernel level tuning, but > > I thought to inquire with irqbalancer expert if we can manage it with > > <irqbalancer> or not. > > Here is next level of problem. Your input will help me to understand > > possibilities. > > > > On my setup Socket 0 has below layout. > > > > NUMANode L#0 (P#0 16GB) > > Socket L#0 + L3 L#0 (10MB) > > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core > > L#0 > > PU L#0 (P#0) > > PU L#1 (P#16) > > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core > > L#1 > > PU L#2 (P#1) > > PU L#3 (P#17) > > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core > > L#2 > > PU L#4 (P#2) > > PU L#5 (P#18) > > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core > > L#3 > > PU L#6 (P#3) > > PU L#7 (P#19) > > > > For simplicity assume only node-0 is active in IO path. Also special > > case like L#0, L#2,L#4 and L#6 are keep busy is busy doing IO > > submission. With latest <ignore> + --<policyscript> option, I see > > completion is on > > L#1,L#3,L#5 and L#7. This can cause IO latency as submitter is keep > > sending which pile up work on other cpu until and unless completion is > > forcefully not migrated to _exact_ cpu. This is currently tunable in > > kernel via <rq_affinity> value 2. > > > > I am trying to understand if this can be done in <irqbalance> itself > > to avoid <rq_affinity> settings in kernel. This way we just need to > > tune only one component. > > > Sort of. If you don't touch rq_affinity, and simply run irq blance such that > each vector gets a unique cpu, then the completion will execute on the cpu > that triggered the interrupt. It won't necessecarily be the cpu that > submitted the request, but I think thats actually far less important in terms > of performance. > > > From you below reply I understood, that assignment of cpu to msix > > within numa/(balance level core) is based on cpu work load(coming from > > cat > > /proc/interrupts) at the time of assignment. > > > > "That is certainly not the case. irqbalance assigns irqs to cpus by > > estimating how much cpu load each irq is responsible for, and > > assigning that irq to the cpu that is the least loaded at the time of > > assignment" > > > > Is it possible and useful if we want to bypass that policy and provide > > <key/value> option ?. Once that option is used, <irqbalance> should > > keep assigning driver's irq# in sequential order not based on interrupt > load. > > > It is possible, thats what the policyscript option is for. But as noted above, > a blind sequential assignment isnt a universal benefit, and so not something > I'm going to make the default for irqbalance. But sure, add a policyscript, > and you can assign irqs in anyway that you feel is best for your specific > workload. Thank Neil.. Understood that default policy is good for deployments where most of the improvement is due to localization NUMA node to h/w. This is what now irqbalancer is doing with default policy. I have everything I wanted to solve my puzzle. Your expertize and active input help me a lot. May be helpful for others who is looking for similar solution. Once again thanks. ! > > Neil > >