On Tue, Oct 13, 2015 at 12:45:01AM +0530, Kashyap Desai wrote: > > On Mon, Oct 12, 2015 at 11:52:30PM +0530, Kashyap Desai wrote: > > > > > What should be the solution if we really want to slow down IO > > > > > submission to avoid CPU lockup. We don't want only one CPU to keep > > > > > busy for completion. > > > > > > > > > > Any suggestion ? > > > > > > > > > Yup, file a bug with Oracle :) > > > > > > Neil - > > > > > > Thanks for info. I understood to use latest <irqbalance>...that was > > > already attempted. I tried with latest irqbalance and I see expected > > > behavior as long as I provide <exact> or <subset> + <--poliicyscript>. > > > We are planning for the same, but wanted to understand what is latest > > > <irqbalancer> default settings. Is there any reason we are seeing > > > default settings changed from subset to ignore ? > > > > > > > Latest defaults are that hinting is ignored by default, but hinting can > also be > > set via a policyscript on an irq by irq basis. > > > > The reasons for changing the default behavior are documented in commit > > d9138c78c3e8cb286864509fc444ebb4484c3d70. Irq affinity hinting is > > effectively a holdover from back in the days when irqbalance couldn't > > understand a devices locality and irq count easily. Now that it can, > there is > > really no need for an irq affinity hint, unless your driver doesn't > properly > > participate in sysfs device ennumeration. > > Neil - I went through those details, but could not understand how <ignore> > policy is useful. I may be missing something here. :-( Yes, what you are missing is the fact that affinity hinting is an outdated method of assigning affinty hints. On any modern kernel its not needed at all, so the default policy is to ignore it. > With <ignore> policy, mpt3sas driver on 32 logical CPU system has below > affinity mask. As you said, driver hint is ignored. That is understood as > <ignore> is hinting for the same, but why affinity mask is just localized > to local node (Node 0 in this case) ? This has nothing to do with ignoring hint policy. The reasons the below might occur are: 1) the class of the device on the pci bus is such that irqbalance is deciding that numa node is the level at which it should be balanced. Currently there are no such devices that get balanced at that level. There are however package level balanced devices, and if you have a single cpu package (with multiple cores) on a single numa node, you might see this behavior. What is the pci class of the mpt3sas adapter? 2) The interrupt controller on your system doesn't allow for user setting of interrupt affinity. I don't think that would be the case given that other interrupts can be affined. If you can manually set the affinity of these irqs you can discount this possibility. 3) You are using a policyscript that assigns these affinities. As I previously requested, are you using a policy script and can you post it here? > What is confusing me is - "cpu affinity mask" is just localize to Numa > Node-0 as PCI device enumeration detected pci device is local to > numa_node 0. I really dont know what you mean by this. Yes, your masks seem to be following what could be your numa node layout, but you're assuming (or it sounds like you're assuming) that irqbalance is doing that intentionally. Its not, one of the above things is going on. > > > When you say "Driver does not participate in sysfs enumeration" - Does it > mean "numa_node" exposure in sysfs or anything more than that ? Sorry for > basics and helping me to understand things. > I mean, does your driver register itself as a pci device? If so, it should have a directory in sysfs in /sys/bus/pci/<pci b:d:f>/. As long as that directory exists and is properly populated, irqbalance should have everything it needs to properly assign a cpu to all of your irqs. Note that the RHEL6 kernel did not always properly populate that directory. I added sysfs code to expose needed irq information in the kernel, and if you have an older kernel and newer irqbalance, that might be part of the problem - another reason to contact oracle. another thing you can try is posting the output of irqbalance while running it with -f and -d. That will give us some insight as to what its doing (note I'm referring here to upstream irqbalance, not the old version). And you still didn't answer my question regarding the policyscript. Neil