> > Ok, your script looks ok (though I'm not sure The kernel you are using > supports the existance of the driver directory in /proc/irq/<n>, so it may > not be functioning as expected). Yes kernel expose /proc/irq/<n>. So kernel has required interface. > > > From attached irqbalance debug output, I can see that irqbalancer able > > to work as expected for policy script. > > > Agreed, it does seem to be functioning properly. Further, I see that those > interrupts are properly assigned to a single core. Which leads me to > wonder exactly what is going on here. The debug output seems to > contradict the affinity mask information you provided earlier > > > > > > > > What is confusing me is - "cpu affinity mask" is just localize to > > > > Numa > > > > Node-0 as PCI device enumeration detected pci device is local to > > > > numa_node 0. > > > I really dont know what you mean by this. Yes, your masks seem to > > > be following what could be your numa node layout, but you're > > > assuming (or > > it > > > sounds like you're assuming) that irqbalance is doing that > > intentionally. Its > > > not, one of the above things is going on. > > > > > > > > > > > > > > > When you say "Driver does not participate in sysfs enumeration" - > > > > Does it mean "numa_node" exposure in sysfs or anything more than > that ? > > > > Sorry for basics and helping me to understand things. > > > > > > > I mean, does your driver register itself as a pci device? If so, it > > should have > > > a directory in sysfs in /sys/bus/pci/<pci b:d:f>/. As long as that > > directory > > > exists and is properly populated, irqbalance should have everything > > > it > > needs > > > to properly assign a cpu to all of your irqs. > > > > Yes driver register device as pci device and I can see all > > /sys/bus/pci/devices/ entry for mpt3sas attached device. > > > Agreed, your debug information bears that out. > > > > Note that the RHEL6 kernel did > > > not always properly populate that directory. I added sysfs code to > > expose > > > needed irq information in the kernel, and if you have an older > > > kernel > > and > > > newer irqbalance, that might be part of the problem - another reason > > > to contact oracle. > > > > > > > > > another thing you can try is posting the output of irqbalance while > > running > > > it with -f and -d. That will give us some insight as to what its > > > doing > > (note > > > I'm referring here to upstream irqbalance, not the old version). > > > And > > you > > > still didn't answer my question regarding the policyscript. > > > > I have attached irqbalance (latest from github last commit > > 8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and policy > script. > > > > Due to some reason, I have to move to different server which has 4 > > Numa socket. > > > > Here Is detail of my setup - > > > > [root]# lstopo-no-graphics > > Machine (64GB) > > NUMANode L#0 (P#0 16GB) > > Socket L#0 + L3 L#0 (10MB) > > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > > PU L#0 (P#0) > > PU L#1 (P#16) > > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > > PU L#2 (P#1) > > PU L#3 (P#17) > > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > > PU L#4 (P#2) > > PU L#5 (P#18) > > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > > PU L#6 (P#3) > > PU L#7 (P#19) > > HostBridge L#0 > > PCIBridge > > PCI 8086:0953 > > PCIBridge > > PCI 1000:0097 > > Block L#0 "sdb" > > Block L#1 "sdc" > > Block L#2 "sdd" > > Block L#3 "sde" > > Block L#4 "sdf" > > Block L#5 "sdg" > > Block L#6 "sdh" > > PCIBridge > > PCI 102b:0532 > > PCI 8086:1d02 > > Block L#7 "sda" > > NUMANode L#1 (P#1 16GB) > > Socket L#1 + L3 L#1 (10MB) > > L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 > > PU L#8 (P#4) > > PU L#9 (P#20) > > L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 > > PU L#10 (P#5) > > PU L#11 (P#21) > > L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 > > PU L#12 (P#6) > > PU L#13 (P#22) > > L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 > > PU L#14 (P#7) > > PU L#15 (P#23) > > HostBridge L#4 > > PCIBridge > > PCI 1000:005b > > NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB) > > L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 > > PU L#16 (P#8) > > PU L#17 (P#24) > > L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 > > PU L#18 (P#9) > > PU L#19 (P#25) > > L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 > > PU L#20 (P#10) > > PU L#21 (P#26) > > L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 > > PU L#22 (P#11) > > PU L#23 (P#27) > > NUMANode L#3 (P#3 16GB) > > Socket L#3 + L3 L#3 (10MB) > > L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 > > PU L#24 (P#12) > > PU L#25 (P#28) > > L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 > > PU L#26 (P#13) > > PU L#27 (P#29) > > L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 > > PU L#28 (P#14) > > PU L#29 (P#30) > > L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 > > PU L#30 (P#15) > > PU L#31 (P#31) > > HostBridge L#6 > > PCIBridge > > PCI 8086:1528 > > Net L#8 "enp193s0f0" > > PCI 8086:1528 > > Net L#9 "enp193s0f1" > > PCIBridge > > PCI 8086:0953 > > PCIBridge > > > > > > Things are bit clear now, but what I am seeing here is - "With > > <ignore> and attached policy hint, CPU to MSIX mask is not same as > > logic sequence of CPU #. It is random within core. I guess it based on > > some link list in irqbalance, which I am not able to understand. > > What exactly do you mean by logic sequence of cpu #? Are you under the > impression that msix vectors should be mapped to their corresponding cpu > number? In above output of lstopo-no-graphics with L# is what I mean by logical sequential cpu. From Driver's point of view we are looking for parallel IO completion on CPUs so we always take logical CPU sequence in account. What I am trying to highlight or understand is can we somehow populate same affinity mask for each CPU as we usually see with <exact> policy + Driver hinting ? That will be the one of the best configuration for high end storage controller/device with multiple completion queue. > That is certainly not the case. irqbalance assigns irqs to cpus by estimating > how much cpu load each irq is responsible for, and assigning that irq to the > cpu that is the least loaded at the time of assignment (the idea being that > cpu load on each cpu is as close to equal as we can get it after balancing). If > you have some additinoal need to place irqs on cpus such that some higher > level task exectues on the same cpu as the interrupt, it is the responsibility > of the higher level task to match the irq placement, not vice versa. You can > do some trickery to make them match up (i.e. do balacing via a policy script > that assigns cores based on the static configuration of the higher level > tasks), but such configurations are never going to be standard, and > generally you need to have those higher level tasks follow the lower level > decisions. We are actually able to solve half problem with additional --policyscript option if <ignore> is default policy. I mean half the problem solve is - "Now driver always receives completion back to same node which was part of submission. There is no cross numa node submission and completion" Now all IO generated from Node-x always complete on Node-x. This is good for storage HBA as we are mainly focusing on parallel completion and not on Numa locality. Next level of issue which can be resolved via Kernel level tuning, but I thought to inquire with irqbalancer expert if we can manage it with <irqbalancer> or not. Here is next level of problem. Your input will help me to understand possibilities. On my setup Socket 0 has below layout. NUMANode L#0 (P#0 16GB) Socket L#0 + L3 L#0 (10MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#16) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#17) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#18) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#19) For simplicity assume only node-0 is active in IO path. Also special case like L#0, L#2,L#4 and L#6 are keep busy is busy doing IO submission. With latest <ignore> + --<policyscript> option, I see completion is on L#1,L#3,L#5 and L#7. This can cause IO latency as submitter is keep sending which pile up work on other cpu until and unless completion is forcefully not migrated to _exact_ cpu. This is currently tunable in kernel via <rq_affinity> value 2. I am trying to understand if this can be done in <irqbalance> itself to avoid <rq_affinity> settings in kernel. This way we just need to tune only one component. >From you below reply I understood, that assignment of cpu to msix within numa/(balance level core) is based on cpu work load(coming from cat /proc/interrupts) at the time of assignment. "That is certainly not the case. irqbalance assigns irqs to cpus by estimating how much cpu load each irq is responsible for, and assigning that irq to the cpu that is the least loaded at the time of assignment" Is it possible and useful if we want to bypass that policy and provide <key/value> option ?. Once that option is used, <irqbalance> should keep assigning driver's irq# in sequential order not based on interrupt load. Thanks for trying to digest my queries and providing all your technical inputs. > > > E.a Below piece of code hint that per Core, irq numbers are stored in > > link list "d->interrupts". This list is not based on sequential > > traverse, rather based on how interrupts are generated. Right ? > > > As noted above, its a measurement of approximate cpu load generated by > each interrupt. As far as assignment goes, there is no relevance to which > cpu handles an interrupt within a numa node from the standpoint of the > interrupt itself. If you have some requirement to align interrupts with other > execution contexts, then you either need to write you poilcy script to > understand those higher level configurations, or you need to modify your > higher level setup to follow what irqbalance does. > > > > static void dump_cache_domain(struct topo_obj *d, void *data) { > > char *buffer = data; > > cpumask_scnprintf(buffer, 4095, d->mask); > > log(TO_CONSOLE, LOG_INFO, "%s%sCache domain %i: numa_node is > > %d cpu mask is %s (load %lu) \n", > > log_indent, log_indent, > > d->number, cache_domain_numa_node(d)->number, buffer, > > (unsigned long)d->load); > > if (d->children) > > for_each_object(d->children, dump_balance_obj, NULL); > > if (g_list_length(d->interrupts) > 0) > > for_each_irq(d->interrupts, dump_irq, (void *)10); } > > > > I see sometimes different cpu masks as below snippet - cpu mask on my > > setup varies on run to run, but good thing is mask is within <core>, > > but not like <exact>. > > > They're not going to be the same, I'm not sure why that is so hard to > understand. All affinity_hint is is a drivers best guess as to where to put an > irq, and its, as a rule, sub-optimal. Expecting irqbalance to arrive at the > same decision as affinity_hint independently is improper reasoning. > > > msix index = 0, irq number = 355, cpu affinity mask = 00000008 > hint > > = 00000001 > > msix index = 1, irq number = 356, cpu affinity mask = 00000004 > hint > > = 00000002 > > msix index = 2, irq number = 357, cpu affinity mask = 00000002 > hint > > = 00000004 > > msix index = 3, irq number = 358, cpu affinity mask = 00000001 > hint > > = 00000008 > > > > > > msix index = 0, irq number = 355, cpu affinity mask = 00000002 > hint > > = 00000001 > > msix index = 1, irq number = 356, cpu affinity mask = 00000008 > hint > > = 00000002 > > msix index = 2, irq number = 357, cpu affinity mask = 00000004 > hint > > = 00000004 > > msix index = 3, irq number = 358, cpu affinity mask = 00000001 > hint > > = 00000008 > > > > I am expecting as below because once <mpt3sas> driver send IO it hint > > FW about completion queue. E.a if IO is submitted from logical CPU #X, > > driver use smp_process_id() to get that logical CPU #X and expect > > completion on same CPU for better performance. Is this expectation > > possible with existing latest <irqbalance> ? > > > It is, but only through the use of a policy script that you write to guarantee > that assignment. With the information that irqbalance has at hand, there is > no need to balance irqs in the below manner. Even if there is some upper > layer indicator that suggests it might be beneficial, theres nothing that > irqbalance can use to consistently determine that. You can certainly do it > with a policy script if you like in any number of ways, but any logic you put > in that script is by and large going to remain there (i.e. its not going to > become part of the default behavior, unless you can provide: > > a) a mechanism to consistently query devices to determine if they should > be balanced in this manner > > b) evidence that for all devices of that class, this provides a performance > benefit. > > In other words, you have to show me which irqs this is a benefit for, and > how to tell if an arbitrary irq fits that profile. > > Note also, that, based on what you're saying above, I don't think this is > going to provide a consistent benefit. It sounds like what you're suggesting > is that the affinity hint of the mpt3sas driver assigns hints based on what > cpu it expects io/requests to come in on (thereby matching completion > interrupts with the submitted io data). But looking at the code: > > a) there doesn't appear to be any assignment like that in mpt3sas, its just > blindly assigning hints based on the online cpus at the time the device was > registered. From the mpt3sas driver in _base_assing_reply_queues: > > list_for_each_entry(reply_q, &ioc->reply_queue_list, list) { > > unsigned int i, group = nr_cpus / nr_msix; > > if (cpu >= nr_cpus) > break; > > if (index < nr_cpus % nr_msix) > group++; > > for (i = 0 ; i < group ; i++) { > ioc->cpu_msix_table[cpu] = index; > cpumask_or(reply_q->affinity_hint, > reply_q->affinity_hint, get_cpu_mask(cpu)); > cpu = cpumask_next(cpu, cpu_online_mask); > } > > if (irq_set_affinity_hint(reply_q->vector, > reply_q->affinity_hint)) > dinitprintk(ioc, pr_info(MPT3SAS_FMT > "error setting affinity hint for irq vector %d\n", > ioc->name, reply_q->vector)); > index++; > } > > and > b) There is no guarantee that, for any given i/o, it will be submitted on a cpu > that is in a mask assigned by the affinity hint. Half of the logic to meet this requirement is in HBA firmware. They will know which msix index to be used for completion. So all of simple equation is map all logical CPU to #of completion queue in h/w. At the same time cpu to msix mapping also needs same interpretation. That was working as expected in <exact> policy and we are trying to meet the same using <ignore> and <--policyscript> > > Neil > > > > msix index = 0, irq number = 355, cpu affinity mask = 00000001 > hint > > = 00000001 > > msix index = 1, irq number = 356, cpu affinity mask = 00000002 > hint > > = 00000002 > > msix index = 2, irq number = 357, cpu affinity mask = 00000004 > hint > > = 00000004 > > msix index = 3, irq number = 358, cpu affinity mask = 00000008 > hint > > = 00000008 > > > > ~ Kashyap > > > > > > > > Neil > > > > >