irqbalancer subset policy and CPU lock up on storage controller.

nhorman@xxxxxxxxxx (Neil Horman) · Mon, 12 Oct 2015 16:20:07 -0400

On Tue, Oct 13, 2015 at 12:45:01AM +0530, Kashyap Desai wrote:
> > On Mon, Oct 12, 2015 at 11:52:30PM +0530, Kashyap Desai wrote:
> > > > > What should be the solution if we really want to slow down IO
> > > > > submission to avoid CPU lockup. We don't want only one CPU to keep
> > > > > busy for completion.
> > > > >
> > > > > Any suggestion ?
> > > > >
> > > > Yup, file a bug with Oracle :)
> > >
> > > Neil -
> > >
> > > Thanks for info. I understood to use latest <irqbalance>...that was
> > > already attempted. I tried with latest irqbalance and I see expected
> > > behavior as long as I provide <exact> or <subset> + <--poliicyscript>.
> > > We are planning for the same, but wanted to understand what is latest
> > > <irqbalancer> default settings. Is there any reason we are seeing
> > > default settings changed from  subset to ignore ?
> > >
> >
> > Latest defaults are that hinting is ignored by default, but hinting can
> also be
> > set via a policyscript on an irq by irq basis.
> >
> > The reasons for changing the default behavior are documented in commit
> > d9138c78c3e8cb286864509fc444ebb4484c3d70.  Irq affinity hinting is
> > effectively a holdover from back in the days when irqbalance couldn't
> > understand a devices locality and irq count easily.  Now that it can,
> there is
> > really no need for an irq affinity hint, unless your driver doesn't
> properly
> > participate in sysfs device ennumeration.
> 
> Neil - I went through those details, but could not understand how <ignore>
> policy is useful. I may be missing something here. :-(
Yes, what you are missing is the fact that affinity hinting is an outdated
method of assigning affinty hints.  On any modern kernel its not needed at all,
so the default policy is to ignore it.

> With <ignore> policy, mpt3sas driver on 32 logical CPU system has below
> affinity mask. As you said, driver hint is ignored.  That is understood as
> <ignore> is hinting for the same, but why affinity mask is just localized
> to local node (Node 0 in this case) ?
This has nothing to do with ignoring hint policy.  The reasons the below might
occur are:

1) the class of the device on the pci bus is such that irqbalance is deciding
that numa node is the level at which it should be balanced.  Currently there are
no such devices that get balanced at that level.  There are however package
level balanced devices, and if you have a single cpu package (with multiple
cores) on a single numa node, you might see this behavior. What is the pci class
of the mpt3sas adapter?

2) The interrupt controller on your system doesn't allow for user setting of
interrupt affinity.  I don't think that would be the case given that other
interrupts can be affined.  If you can manually set the affinity of these irqs
you can discount this possibility.

3) You are using a policyscript that assigns these affinities.  As I previously
requested, are you using a policy script and can you post it here?

> What is confusing me is - "cpu affinity mask" is just localize to Numa
> Node-0  as PCI device enumeration detected pci device is local to
> numa_node 0.
I really dont know what you mean by this.  Yes, your masks seem to be following
what could be your numa node layout, but you're assuming (or it sounds like
you're assuming) that irqbalance is doing that intentionally.  Its not, one of
the above things is going on.

> 
> 
> When you say "Driver does not participate in sysfs enumeration" - Does it
> mean "numa_node" exposure in sysfs or anything more than that ? Sorry for
> basics and helping me to understand things.
> 
I mean, does your driver register itself as a pci device?  If so, it should have
a directory in sysfs in /sys/bus/pci/<pci b:d:f>/.  As long as that directory
exists and is properly populated, irqbalance should have everything it needs to
properly assign a cpu to all of your irqs.  Note that the RHEL6 kernel did not
always properly populate that directory.  I added sysfs code to expose needed
irq information in the kernel, and if you have an older kernel and newer
irqbalance, that might be part of the problem - another reason to contact
oracle.

another thing you can try is posting the output of irqbalance while running it
with -f and -d.  That will give us some insight as to what its doing (note I'm
referring here to upstream irqbalance, not the old version).  And you still
didn't answer my question regarding the policyscript.

Neil