IRQ balancing, distribution

chibi@xxxxxxx (Christian Balzer) · Tue, 23 Sep 2014 13:16:13 +0900

Hello,

On Mon, 22 Sep 2014 08:55:48 -0500 Mark Nelson wrote:

> On 09/22/2014 01:55 AM, Christian Balzer wrote:
> >
> > Hello,
> >
> > not really specific to Ceph, but since one of the default questions by
> > the Ceph team when people are facing performance problems seems to be
> > "Have you tried turning it off and on again?" ^o^ err,
> > "Are all your interrupts on one CPU?"
> > I'm going to wax on about this for a bit and hope for some feedback
> > from others with different experiences and architectures than me.
> 
> This may be a result of me harping about this after a customer's 
> clusters had mysterious performance issues and where irqbalance didn't 
> appear to be working properly. :)
> 
> >
> > Now firstly that question if all your IRQ handling is happening on the
> > same CPU is a valid one, as depending on a bewildering range of factors
> > ranging from kernel parameters to actual hardware one often does indeed
> > wind up with that scenario, usually with all on CPU0.
> > Which certainly is the case with all my recent hardware and Debian
> > kernels.
> 
> Yes, there are certainly a lot of scenarios where this can happen.  I 
> think the hope has been that with MSI-X, interrupts will get evenly 
> distributed by default and that is typically better than throwing them 
> all at core 0, but things are still quite complicated.
> 
> >
> > I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
> > thus feedback from Intel users is very much sought after, as I'm
> > considering Intel based storage nodes in the future.
> > It's vaguely amusing that Ceph storage nodes seem to have more CPU
> > (individual core performance, not necessarily # of cores) and similar
> > RAM requirements than my VM hosts. ^o^
> 
> It might be reasonable to say that Ceph is a pretty intensive piece of 
> software.  With lots of OSDs on a system there are hundreds if not 
> thousands of threads.  Under heavy load conditions the CPUs, network 
> cards, HBAs, memory, socket interconnects, possibly SAS expanders are 
> all getting worked pretty hard and possibly in unusual ways where both 
> throughput and latency are important.  At the cluster scale things like 
> switch bisection bandwidth and network topology become issues too.  High 
> performance clustered storage is imho one of the most complicated 
> performance subjects in computing.
> 
Nobody will argue that. ^.^

> The good news is that much of this can be avoided by sticking to simple 
> designs with fewer OSDs per node.  The more OSDs you try to stick in 1 
> system, the more you need to worry about all of this if you care about 
> high performance.
> 
I'd say that 8 OSDs isn't exactly dense (my case), but the advantages
of less densely populated nodes come with the significant price tag of
rack space and hardware costs.

> >
> > So the common wisdom is that all IRQs on one CPU is a bad thing, lest
> > it gets overloaded and for example drop network packets because of
> > this. And while that is true, I'm hard pressed to generate any load on
> > my clusters where the IRQ ratio on CPU0 goes much beyond 50%.
> >
> > Thus it should come as no surprise that spreading out IRQs with
> > irqbalance or more accurately by manually setting
> > the /proc/irq/xx/smp_affinity mask doesn't give me any discernible
> > differences when it comes to benchmark results.
> 
> Ok, that's fine, but this is pretty subjective.  Without knowing the 
> load and the hardware setup I don't think we can really draw any 
> conclusions other than that in your test on your hardware this wasn't 
> the bottleneck.
> 
Of course, I can only realistically talk about what I have tested and thus
invited feedback from others. 
I can certainly see situations where this could be an issue with Ceph and
do have experience with VM hosts that benefited from spreading IRQ
handling over more than one CPU. 

What I'm trying to get across is for people to not fall into a cargo cult
trap and think/examine things for themselves, as blindly turning on
indiscriminate IRQ balancing might do more harm than good in certain
scenarios.  

> >
> > With irqbalance spreading things out willy-nilly w/o any regards or
> > knowledge about the hardware and what IRQ does what it's definitely
> > something I won't be using out of the box. This goes especially for
> > systems with different NUMA regions without proper policyscripts for
> > irqbalance.
> 
> I believe irqbalance takes PCI topology into account when making mapping 
> decisions.  See:
> 
> http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html
> 

I'm sure it tries to do the right thing and it gets at least some things
right, like what my system (single Opteron 4386) looks like:
---
Package 0:  numa_node is 0 cpu mask is 000000ff (load 0)
        Cache domain 0:  numa_node is 0 cpu mask is 00000003  (load 0) 
                CPU number 0  numa_node is 0 (load 0)
                CPU number 1  numa_node is 0 (load 0)
        Cache domain 1:  numa_node is 0 cpu mask is 0000000c  (load 0) 
                CPU number 2  numa_node is 0 (load 0)
                CPU number 3  numa_node is 0 (load 0)
        Cache domain 2:  numa_node is 0 cpu mask is 00000030  (load 0) 
                CPU number 4  numa_node is 0 (load 0)
                CPU number 5  numa_node is 0 (load 0)
        Cache domain 3:  numa_node is 0 cpu mask is 000000c0  (load 0) 
                CPU number 6  numa_node is 0 (load 0)
                CPU number 7  numa_node is 0 (load 0)
---

It also kinda sorta figures out the the onboard ethernet (network) as well
as AHCI and LSI HBA interupts (storage). 
However it fails to see the Mellanox IB HCA as a network device and does
inane things like putting eth0-rx-0 and eth0-tx-0 on different cache
domains.
Combined with it constantly moving IRQs around based on load (and thus
probably fighting the scheduler in a way) I really don't think I will be
ever running this in the background anywhere. 
While intra CPU migration is cheap compared to page migration from another
NUMA region, it is not free either.

On a dual Opteron 6378 it wrongly assigned the 2 CPUs into 2 NUMA groups,
whereas the kernel got it right, there are 4 NUMA groups (of sorts, as
the L3 cache is shared per 8 cores):
---
x86: Booting SMP configuration:
.... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7
.... node  #1, CPUs:    #8  #9 #10 #11 #12 #13 #14 #15
.... node  #2, CPUs:   #16 #17 #18 #19 #20 #21 #22 #23
.... node  #3, CPUs:   #24 #25 #26 #27 #28 #29 #30 #31
x86: Booted up 4 nodes, 32 CPUs
---
In addition irqbalance did distribute the 8 MSI-X interrupts for eth0 over
all NUMA regions, while heaping everything else on CPU24...

Again, w/o any real pressure on the IRQ handling front I'd rather assign
things manually and statically (or not at all) and let the scheduler handle
the rest.

> >
> > So for my current hardware I'm going to keep IRQs on CPU0 and CPU1
> > which are the same Bulldozer module and thus sharing L2 and L3 cache.
> > In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs
> > on CPU0 and the network (Infiniband) on CPU1.
> > That should give me sufficient reserves in processing power and keep
> > intra core (module) and NUMA (additional physical CPUs) traffic to a
> > minimum. This also will (within a certain load range) allow these 2
> > CPUs (module) to be ramped up to full speed while other cores can
> > remain at a lower frequency.
> 
> So it's been a while since I looked at AMD CPU interconnect topology, 
> but back in the magnycours era I drew up some diagrams:
> 
> 2 socket:
> 
> https://docs.google.com/drawings/d/1_egexLqN14k9bhoN2nkv3iTgAbbPcwuwJmhwWAmakwo/edit?usp=sharing
> 
> 4 socket:
> 
> https://docs.google.com/drawings/d/1V5sFSInKq3uuKRbETx1LVOURyYQF_9Z4zElPrl1YIrw/edit?usp=sharing
> 
> I think Interlagos looks somewhat similar from a hypertransport 
> perspective.  My gut instinct  is that you really want to keep 
> everything you can local to the socket on these kinds of systems.  So if 
> your HBA is on the first socket, you want your processing and interrupt 
> handling there too.  In the 4-socket configuration this is especially 
> true.  It's entirely possible that you may have to go through both an 
> on-die and a inter-socket HT link before you get to a neighbour CPU. 
> With the 2-socket configuration it's not quite as bad.
> 
> Intel CPUs in some ways are nicer because you have fewer cores that are 
> faster and often have much more straightforward interconnect topologies 
> (though at the high-end sometimes bizarre tradeoffs get made for memory 
> like "flexmem bridges" and such.)  Better to just stick with a simpler 
> and straightforward architecture imho.
> 

As I said, I'm keeping it to single CPU deployments for the time being and
thus deploying fast 43xx Opterons. Which are more than adequate for the
job with the given hardware/density and quite a bit cheaper than Intel.

For denser or faster (SSD) storage nodes I'm looking at Intel CPUs, given
the appetite of Ceph for CPU cycles.

> >
> > Now with Intel some PCIe lanes are handled by a specific CPU (that's
> > why you often see the need for adding a 2nd CPU to use all slots) and
> > in that case pinning the IRQ handling for those slots on a specific
> > CPU might actually make a lot of sense. Especially if not all the
> > traffic generated by that card will have to transferred to the other
> > CPU anyway.
> 
> You need to think about that on just about any multi-socket system 
> except possibly those that have full-throughput links to an external IO 
> HUB from every socket.
> 
Of course, but with AMD you always wind up with with all the I/O on CPU0,
no brains required, keep your IRQ handling there. ^o^

Christian

> >
> >
> > Christian
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/