IRQ balancing, distribution

chibi@xxxxxxx (Christian Balzer) · Mon, 22 Sep 2014 17:21:41 +0900

Hello,

On Mon, 22 Sep 2014 09:35:10 +0200 Stijn De Weirdt wrote:

> hi christian,
> 
> we once were debugging some performance isssues, and IRQ balancing was 
> one of the issues we looked in, but no real benefit there for us.
> all interrupts on one cpu is only an issue if the hardware itself is not 
> the bottleneck. 
In particular the spinning rust. ^o^
But this crept up in recent discussions about all SSD OSD storage servers,
so there is some (remote) possibility for this to happen.

>we were running some default SAS HBA (Dell H200), and 
> those simply can't generated enough load to cause any IRQ issue even on 
> older AMD cpus (we did tests on R515 boxes). (there was a ceph 
> persentation somewhere that highlights the impact of using the proper 
> the disk controller, we'll have to fix that first in our case. i'll be 
> happy if IRQ balancing actually becomes an issue ;)
> 
Yeah, this pretty much matches what I'm seeing and experienced over the
years.

> but another issue is the OSD processes: do you pin those as well? and 
> how much data do they actually handle. to checksum, the OSD process 
> needs all data, so that can also cause a lot of NUMA traffic, esp if 
> they are not pinned.
> 
That's why all my (production) storage nodes have only a single 6 or 8
core CPU. Unfortunately that also limits the amount of RAM in there, 16GB
modules have just recently become an economically viable alternative to
8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs
and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not
IOwait!) resources with the right (or is that wrong) tests, namely 4K
FIOs. 

The linux scheduler usually is quite decent in keeping processes where the
action is, thus you see for example a clear preference of DRBD or KVM vnet
processes to be "near" or on the CPU(s) where the IRQs are.

> i sort of hope that current CPUs have enough pcie lanes and cores so we 
> can use single socket nodes, to avoid at least the NUMA traffic.
> 
Even the lackluster Opterons with just PCIe v2 and less lanes than current
Intel CPUs are plenty fast enough (sufficient bandwidth) when it comes to
the storage node density I'm deploying.

Christian
> stijn
> 
> > not really specific to Ceph, but since one of the default questions by
> > the Ceph team when people are facing performance problems seems to be
> > "Have you tried turning it off and on again?" ^o^ err,
> > "Are all your interrupts on one CPU?"
> > I'm going to wax on about this for a bit and hope for some feedback
> > from others with different experiences and architectures than me.
> >
> > Now firstly that question if all your IRQ handling is happening on the
> > same CPU is a valid one, as depending on a bewildering range of factors
> > ranging from kernel parameters to actual hardware one often does indeed
> > wind up with that scenario, usually with all on CPU0.
> > Which certainly is the case with all my recent hardware and Debian
> > kernels.
> >
> > I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
> > thus feedback from Intel users is very much sought after, as I'm
> > considering Intel based storage nodes in the future.
> > It's vaguely amusing that Ceph storage nodes seem to have more CPU
> > (individual core performance, not necessarily # of cores) and similar
> > RAM requirements than my VM hosts. ^o^
> >
> > So the common wisdom is that all IRQs on one CPU is a bad thing, lest
> > it gets overloaded and for example drop network packets because of
> > this. And while that is true, I'm hard pressed to generate any load on
> > my clusters where the IRQ ratio on CPU0 goes much beyond 50%.
> >
> > Thus it should come as no surprise that spreading out IRQs with
> > irqbalance or more accurately by manually setting
> > the /proc/irq/xx/smp_affinity mask doesn't give me any discernible
> > differences when it comes to benchmark results.
> >
> > With irqbalance spreading things out willy-nilly w/o any regards or
> > knowledge about the hardware and what IRQ does what it's definitely
> > something I won't be using out of the box. This goes especially for
> > systems with different NUMA regions without proper policyscripts for
> > irqbalance.
> >
> > So for my current hardware I'm going to keep IRQs on CPU0 and CPU1
> > which are the same Bulldozer module and thus sharing L2 and L3 cache.
> > In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs
> > on CPU0 and the network (Infiniband) on CPU1.
> > That should give me sufficient reserves in processing power and keep
> > intra core (module) and NUMA (additional physical CPUs) traffic to a
> > minimum. This also will (within a certain load range) allow these 2
> > CPUs (module) to be ramped up to full speed while other cores can
> > remain at a lower frequency.
> >
> > Now with Intel some PCIe lanes are handled by a specific CPU (that's
> > why you often see the need for adding a 2nd CPU to use all slots) and
> > in that case pinning the IRQ handling for those slots on a specific
> > CPU might actually make a lot of sense. Especially if not all the
> > traffic generated by that card will have to transferred to the other
> > CPU anyway.
> >
> >
> > Christian
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/