Re: Numa pinning best practices

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sat, 14 Sep 2024 20:11:48 -0400

> 
> 
> On Fri, Sep 13, 2024 at 12:16 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>> My sense is that with recent OS and kernel releases (e.g., not CentOS 8) irqbalance does a halfway decent job.
> 
> Strongly disagree! Canonical has actually disabled it by default in
> Ubuntu 24.04 and IIRC Debian already does, too:
> https://discourse.ubuntu.com/t/ubuntu-24-04-lts-noble-numbat-release-notes/39890#irqbalance-no-more-installed-and-enabled-by-default

Interesting.  The varied viewpoints of the Ceph community are invaluable.

Reading the above page, I infer that recent kernels do well by default now?

> While irqbalance _can_ do a decent job in some scenarios, it can also
> really mess things up. For something like Ceph where you are likely
> running a lot of the same platform(s) and are seeking predictability,
> you can probably do better controlling affinity yourself. At least,
> you should be able to do no worse.

Fair enough, would love to 

> 
>>> I came across a recent Ceph day NYC talk from Tyler Stachecki (Bloomberg) [1] and a Reddit post [2]. Apparently there is quita a bit of performance to gain when NUMA is optimally configured for Ceph.
>> 
>> My sense is that NUMA is very much a function of what CPUs one is using, and 1S vs 2S / 4S.  With 4S servers I've seen people using multiple NICs, multiple HBAs, etc., effectively partitioning into 4x 1S servers.  Why not save yourself hassle and just use 1S to begin with?  4+S-capable CPUs cost more and sometimes lag generationally.
> 
> Hey, that's me!

I first saw an elaborate 4S pinning scheme at an OpenStack Summit, 2016 or so.

> As Anthony says, YMMV based on your platform, what you use Ceph for
> (RBD?), and also how much Ceph you're running.
> 
> Early versions of Zen had quite bad core to core memory latency when
> you hopped across CCD/CCX.

There’s a graphic out there comparing those latencies for …. IIRC, Icelake and Rome or Milan.

> There's some early warning signs in the Zen
> 5 client reviews that such latencies may be back to bite (I have not
> gotten my hands on one yet, nor have I see anyone explain "why" yet):

Ouch.  Would one interpret this as Genoa being better?

> https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/3
> 
> In the diagram within that article you can clearly see the ~180ns
> difference, as well as the "striping" effect, when you cross a CCX.
> I'm wondering this is a byproduct of the new ladder cache design
> within the Zen 5 CCX? Regardless: if you have latencies like this
> within a single socket, you likely stand to gain something by pinning
> processes to NUMA nodes even with 1P servers. The results mentioned in
> my presentation are all based on 1P platforms as well for comparison.

Which presentation?  I want to read through that carefully.  I’m about to deploy a bunch of 1S EPYC 9454 systems with 30TB SSDs for RBD, RGW, and perhaps later CephFS.  After clamoring for 1S systems for years I finally got my wish, now I want to optimize them as best I can, especially with 12x 30TB SSDs each (PCI-e Gen 4, QLC and TLC).    Bonded 100GE.

In the past I inherited scripting that spread HBA and NIC interrupts across physical cores (every other thread) and messed with the CPU governor, but have not dived deeply into NVMe interrupts yet.

> 
>>> So what is most optimal there? Does it still make sense to have the Ceph processes bound to the CPU where their respective NVMe resides when the network interface card is attached to another CPU / NUMA node? Or would this just result in more inter NUMA traffic (latency) and negate any possible gains that could have been made?
> 
> I never benchmarked this, so I can only guess.
> 
> However: if you look at /proc/interrupts, you will see that most if
> not all enterprise NVMes in Linux effectively get allocated a MSI
> vector per thread per NVMe. Moreover, if you look at
> /proc/<irq>/smp_affinity for each of those MSI vectors, you will see
> that they are each pinned to exactly one CPU thread.
> 
> In my experience, when NUMA pinning OSDs, only the MSI vectors local
> to the NUMA node where the OSD runs really have any activity. That
> seems optimal, so I've never had a reason to look any further.
> 
>>> So the default policy seems to be active, and no Ceph NUMA affinity seems to have taken place. Can someone explain me what Ceph (cephadm) is currently doing when the "osd_numa_auto_affinity" config setting is true and NUMA is exposed?
> 
> I, personally, am in the camp of folk who are not cephadm fans. What I
> did in my case was to write a shim that sits in front of the
> ceph-osd@.service unit, effectively overriding the default
> ExecStart=/usr/bin/ceph-osd.... and replacing it with
> ExecStart=/usr/local/bin/my_numa_shim /usr/bin/ceph-osd...
> 
> The my_numa_shim is a tool which has some apriori knowledge of how the
> platforms are configured, and makes a decision about which NUMA node
> to use for a given OSD after probing which NUMA node is most local to
> the storage device associated with the OSD. It then sets the
> affinity/memory allocation mode of the process and does an execve to
> call /usr/bin/ceph-osd as systemd had originally intended. The pinning
> is not changed by the execve.

Is that tool available?

> 
> Would something similar work with cephadm? Probably, but offhand I
> have no idea how to implement it.
> 
> Cheers,
> Tyler
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx