+0530, Kashyap Desai wrote: > > > On Mon, Oct 12, 2015 at 11:52:30PM +0530, Kashyap Desai wrote: > > > > > > What should be the solution if we really want to slow down IO > > > > > > submission to avoid CPU lockup. We don't want only one CPU to > > > > > > keep busy for completion. > > > > > > > > > > > > Any suggestion ? > > > > > > > > > > > Yup, file a bug with Oracle :) > > > > > > > > Neil - > > > > > > > > Thanks for info. I understood to use latest <irqbalance>...that > > > > was already attempted. I tried with latest irqbalance and I see > > > > expected behavior as long as I provide <exact> or <subset> + <-- > poliicyscript>. > > > > We are planning for the same, but wanted to understand what is > > > > latest <irqbalancer> default settings. Is there any reason we are > > > > seeing default settings changed from subset to ignore ? > > > > > > > > > > Latest defaults are that hinting is ignored by default, but hinting > > > can > > also be > > > set via a policyscript on an irq by irq basis. > > > > > > The reasons for changing the default behavior are documented in > > > commit d9138c78c3e8cb286864509fc444ebb4484c3d70. Irq affinity > > > hinting is effectively a holdover from back in the days when > > > irqbalance couldn't understand a devices locality and irq count > > > easily. Now that it can, > > there is > > > really no need for an irq affinity hint, unless your driver doesn't > > properly > > > participate in sysfs device ennumeration. > > > > Neil - I went through those details, but could not understand how > > <ignore> policy is useful. I may be missing something here. :-( > Yes, what you are missing is the fact that affinity hinting is an outdated > method of assigning affinty hints. On any modern kernel its not needed at > all, so the default policy is to ignore it. Now it clear. Understood that now no more affinity hint is required from driver and <irqbalance> can manage using required detail populated from <sysfs>. > > > With <ignore> policy, mpt3sas driver on 32 logical CPU system has > > below affinity mask. As you said, driver hint is ignored. That is > > understood as <ignore> is hinting for the same, but why affinity mask > > is just localized to local node (Node 0 in this case) ? > This has nothing to do with ignoring hint policy. The reasons the below > might occur are: > > 1) the class of the device on the pci bus is such that irqbalance is deciding > that numa node is the level at which it should be balanced. Currently there > are no such devices that get balanced at that level. There are however > package level balanced devices, and if you have a single cpu package (with > multiple > cores) on a single numa node, you might see this behavior. What is the pci > class of the mpt3sas adapter? <mpt3sas> is <storage> class adapter. See <class> sysfs details- [root]# cat /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.0/class 0x010700 > > 2) The interrupt controller on your system doesn't allow for user setting of > interrupt affinity. I don't think that would be the case given that other > interrupts can be affined. If you can manually set the affinity of these irqs > you can discount this possibility. Affinity hint from driver provided by <exact> policy and manually setting affinity works on my setup. We can skip this part. >From storage controller requirement side, we are looking for msix-vector and logic CPU# mapping in same sequence. > > 3) You are using a policyscript that assigns these affinities. As I previously > requested, are you using a policy script and can you post it here? I have attached policy script (This very basic script..just to understand irqbalance we created this and we got our work done). I required - "use balancing as core level and distribute policy across each NUMA node." >From attached irqbalance debug output, I can see that irqbalancer able to work as expected for policy script. > > > What is confusing me is - "cpu affinity mask" is just localize to Numa > > Node-0 as PCI device enumeration detected pci device is local to > > numa_node 0. > I really dont know what you mean by this. Yes, your masks seem to be > following what could be your numa node layout, but you're assuming (or it > sounds like you're assuming) that irqbalance is doing that intentionally. Its > not, one of the above things is going on. > > > > > > > When you say "Driver does not participate in sysfs enumeration" - Does > > it mean "numa_node" exposure in sysfs or anything more than that ? > > Sorry for basics and helping me to understand things. > > > I mean, does your driver register itself as a pci device? If so, it should have > a directory in sysfs in /sys/bus/pci/<pci b:d:f>/. As long as that directory > exists and is properly populated, irqbalance should have everything it needs > to properly assign a cpu to all of your irqs. Yes driver register device as pci device and I can see all /sys/bus/pci/devices/ entry for mpt3sas attached device. > Note that the RHEL6 kernel did > not always properly populate that directory. I added sysfs code to expose > needed irq information in the kernel, and if you have an older kernel and > newer irqbalance, that might be part of the problem - another reason to > contact oracle. > > > another thing you can try is posting the output of irqbalance while running > it with -f and -d. That will give us some insight as to what its doing (note > I'm referring here to upstream irqbalance, not the old version). And you > still didn't answer my question regarding the policyscript. I have attached irqbalance (latest from github last commit 8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and policy script. Due to some reason, I have to move to different server which has 4 Numa socket. Here Is detail of my setup - [root]# lstopo-no-graphics Machine (64GB) NUMANode L#0 (P#0 16GB) Socket L#0 + L3 L#0 (10MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#16) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#17) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#18) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#19) HostBridge L#0 PCIBridge PCI 8086:0953 PCIBridge PCI 1000:0097 Block L#0 "sdb" Block L#1 "sdc" Block L#2 "sdd" Block L#3 "sde" Block L#4 "sdf" Block L#5 "sdg" Block L#6 "sdh" PCIBridge PCI 102b:0532 PCI 8086:1d02 Block L#7 "sda" NUMANode L#1 (P#1 16GB) Socket L#1 + L3 L#1 (10MB) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#20) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#21) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#22) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#23) HostBridge L#4 PCIBridge PCI 1000:005b NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#24) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#25) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#26) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#27) NUMANode L#3 (P#3 16GB) Socket L#3 + L3 L#3 (10MB) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#28) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#29) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#30) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#31) HostBridge L#6 PCIBridge PCI 8086:1528 Net L#8 "enp193s0f0" PCI 8086:1528 Net L#9 "enp193s0f1" PCIBridge PCI 8086:0953 PCIBridge Things are bit clear now, but what I am seeing here is - "With <ignore> and attached policy hint, CPU to MSIX mask is not same as logic sequence of CPU #. It is random within core. I guess it based on some link list in irqbalance, which I am not able to understand. E.a Below piece of code hint that per Core, irq numbers are stored in link list "d->interrupts". This list is not based on sequential traverse, rather based on how interrupts are generated. Right ? static void dump_cache_domain(struct topo_obj *d, void *data) { char *buffer = data; cpumask_scnprintf(buffer, 4095, d->mask); log(TO_CONSOLE, LOG_INFO, "%s%sCache domain %i: numa_node is %d cpu mask is %s (load %lu) \n", log_indent, log_indent, d->number, cache_domain_numa_node(d)->number, buffer, (unsigned long)d->load); if (d->children) for_each_object(d->children, dump_balance_obj, NULL); if (g_list_length(d->interrupts) > 0) for_each_irq(d->interrupts, dump_irq, (void *)10); } I see sometimes different cpu masks as below snippet - cpu mask on my setup varies on run to run, but good thing is mask is within <core>, but not like <exact>. msix index = 0, irq number = 355, cpu affinity mask = 00000008 hint = 00000001 msix index = 1, irq number = 356, cpu affinity mask = 00000004 hint = 00000002 msix index = 2, irq number = 357, cpu affinity mask = 00000002 hint = 00000004 msix index = 3, irq number = 358, cpu affinity mask = 00000001 hint = 00000008 msix index = 0, irq number = 355, cpu affinity mask = 00000002 hint = 00000001 msix index = 1, irq number = 356, cpu affinity mask = 00000008 hint = 00000002 msix index = 2, irq number = 357, cpu affinity mask = 00000004 hint = 00000004 msix index = 3, irq number = 358, cpu affinity mask = 00000001 hint = 00000008 I am expecting as below because once <mpt3sas> driver send IO it hint FW about completion queue. E.a if IO is submitted from logical CPU #X, driver use smp_process_id() to get that logical CPU #X and expect completion on same CPU for better performance. Is this expectation possible with existing latest <irqbalance> ? msix index = 0, irq number = 355, cpu affinity mask = 00000001 hint = 00000001 msix index = 1, irq number = 356, cpu affinity mask = 00000002 hint = 00000002 msix index = 2, irq number = 357, cpu affinity mask = 00000004 hint = 00000004 msix index = 3, irq number = 358, cpu affinity mask = 00000008 hint = 00000008 ~ Kashyap > > Neil > -------------- next part -------------- A non-text attachment was scrubbed... Name: irqbalance.debug Type: application/octet-stream Size: 50807 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/irqbalance/attachments/20151013/f214af70/attachment-0002.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: set_numa_node.sh Type: application/octet-stream Size: 1440 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/irqbalance/attachments/20151013/f214af70/attachment-0003.obj>