Re: Numa pinning best practices

Tyler Stachecki <stachecki.tyler@xxxxxxxxx> · Sat, 14 Sep 2024 16:18:02 -0400

On Fri, Sep 13, 2024 at 12:16 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> My sense is that with recent OS and kernel releases (e.g., not CentOS 8) irqbalance does a halfway decent job.

Strongly disagree! Canonical has actually disabled it by default in
Ubuntu 24.04 and IIRC Debian already does, too:
https://discourse.ubuntu.com/t/ubuntu-24-04-lts-noble-numbat-release-notes/39890#irqbalance-no-more-installed-and-enabled-by-default

While irqbalance _can_ do a decent job in some scenarios, it can also
really mess things up. For something like Ceph where you are likely
running a lot of the same platform(s) and are seeking predictability,
you can probably do better controlling affinity yourself. At least,
you should be able to do no worse.

> > I came across a recent Ceph day NYC talk from Tyler Stachecki (Bloomberg) [1] and a Reddit post [2]. Apparently there is quita a bit of performance to gain when NUMA is optimally configured for Ceph.
>
> My sense is that NUMA is very much a function of what CPUs one is using, and 1S vs 2S / 4S.  With 4S servers I've seen people using multiple NICs, multiple HBAs, etc., effectively partitioning into 4x 1S servers.  Why not save yourself hassle and just use 1S to begin with?  4+S-capable CPUs cost more and sometimes lag generationally.

Hey, that's me!

As Anthony says, YMMV based on your platform, what you use Ceph for
(RBD?), and also how much Ceph you're running.

Early versions of Zen had quite bad core to core memory latency when
you hopped across CCD/CCX. There's some early warning signs in the Zen
5 client reviews that such latencies may be back to bite (I have not
gotten my hands on one yet, nor have I see anyone explain "why" yet):
https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/3

In the diagram within that article you can clearly see the ~180ns
difference, as well as the "striping" effect, when you cross a CCX.
I'm wondering this is a byproduct of the new ladder cache design
within the Zen 5 CCX? Regardless: if you have latencies like this
within a single socket, you likely stand to gain something by pinning
processes to NUMA nodes even with 1P servers. The results mentioned in
my presentation are all based on 1P platforms as well for comparison.

> > So what is most optimal there? Does it still make sense to have the Ceph processes bound to the CPU where their respective NVMe resides when the network interface card is attached to another CPU / NUMA node? Or would this just result in more inter NUMA traffic (latency) and negate any possible gains that could have been made?

I never benchmarked this, so I can only guess.

However: if you look at /proc/interrupts, you will see that most if
not all enterprise NVMes in Linux effectively get allocated a MSI
vector per thread per NVMe. Moreover, if you look at
/proc/<irq>/smp_affinity for each of those MSI vectors, you will see
that they are each pinned to exactly one CPU thread.

In my experience, when NUMA pinning OSDs, only the MSI vectors local
to the NUMA node where the OSD runs really have any activity. That
seems optimal, so I've never had a reason to look any further.

> > So the default policy seems to be active, and no Ceph NUMA affinity seems to have taken place. Can someone explain me what Ceph (cephadm) is currently doing when the "osd_numa_auto_affinity" config setting is true and NUMA is exposed?

I, personally, am in the camp of folk who are not cephadm fans. What I
did in my case was to write a shim that sits in front of the
ceph-osd@.service unit, effectively overriding the default
ExecStart=/usr/bin/ceph-osd.... and replacing it with
ExecStart=/usr/local/bin/my_numa_shim /usr/bin/ceph-osd...

The my_numa_shim is a tool which has some apriori knowledge of how the
platforms are configured, and makes a decision about which NUMA node
to use for a given OSD after probing which NUMA node is most local to
the storage device associated with the OSD. It then sets the
affinity/memory allocation mode of the process and does an execve to
call /usr/bin/ceph-osd as systemd had originally intended. The pinning
is not changed by the execve.

Would something similar work with cephadm? Probably, but offhand I
have no idea how to implement it.

Cheers,
Tyler
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx