On Fri, Sep 13, 2024 at 12:16 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > My sense is that with recent OS and kernel releases (e.g., not CentOS 8) irqbalance does a halfway decent job. Strongly disagree! Canonical has actually disabled it by default in Ubuntu 24.04 and IIRC Debian already does, too: https://discourse.ubuntu.com/t/ubuntu-24-04-lts-noble-numbat-release-notes/39890#irqbalance-no-more-installed-and-enabled-by-default While irqbalance _can_ do a decent job in some scenarios, it can also really mess things up. For something like Ceph where you are likely running a lot of the same platform(s) and are seeking predictability, you can probably do better controlling affinity yourself. At least, you should be able to do no worse. > > I came across a recent Ceph day NYC talk from Tyler Stachecki (Bloomberg) [1] and a Reddit post [2]. Apparently there is quita a bit of performance to gain when NUMA is optimally configured for Ceph. > > My sense is that NUMA is very much a function of what CPUs one is using, and 1S vs 2S / 4S. With 4S servers I've seen people using multiple NICs, multiple HBAs, etc., effectively partitioning into 4x 1S servers. Why not save yourself hassle and just use 1S to begin with? 4+S-capable CPUs cost more and sometimes lag generationally. Hey, that's me! As Anthony says, YMMV based on your platform, what you use Ceph for (RBD?), and also how much Ceph you're running. Early versions of Zen had quite bad core to core memory latency when you hopped across CCD/CCX. There's some early warning signs in the Zen 5 client reviews that such latencies may be back to bite (I have not gotten my hands on one yet, nor have I see anyone explain "why" yet): https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/3 In the diagram within that article you can clearly see the ~180ns difference, as well as the "striping" effect, when you cross a CCX. I'm wondering this is a byproduct of the new ladder cache design within the Zen 5 CCX? Regardless: if you have latencies like this within a single socket, you likely stand to gain something by pinning processes to NUMA nodes even with 1P servers. The results mentioned in my presentation are all based on 1P platforms as well for comparison. > > So what is most optimal there? Does it still make sense to have the Ceph processes bound to the CPU where their respective NVMe resides when the network interface card is attached to another CPU / NUMA node? Or would this just result in more inter NUMA traffic (latency) and negate any possible gains that could have been made? I never benchmarked this, so I can only guess. However: if you look at /proc/interrupts, you will see that most if not all enterprise NVMes in Linux effectively get allocated a MSI vector per thread per NVMe. Moreover, if you look at /proc/<irq>/smp_affinity for each of those MSI vectors, you will see that they are each pinned to exactly one CPU thread. In my experience, when NUMA pinning OSDs, only the MSI vectors local to the NUMA node where the OSD runs really have any activity. That seems optimal, so I've never had a reason to look any further. > > So the default policy seems to be active, and no Ceph NUMA affinity seems to have taken place. Can someone explain me what Ceph (cephadm) is currently doing when the "osd_numa_auto_affinity" config setting is true and NUMA is exposed? I, personally, am in the camp of folk who are not cephadm fans. What I did in my case was to write a shim that sits in front of the ceph-osd@.service unit, effectively overriding the default ExecStart=/usr/bin/ceph-osd.... and replacing it with ExecStart=/usr/local/bin/my_numa_shim /usr/bin/ceph-osd... The my_numa_shim is a tool which has some apriori knowledge of how the platforms are configured, and makes a decision about which NUMA node to use for a given OSD after probing which NUMA node is most local to the storage device associated with the OSD. It then sets the affinity/memory allocation mode of the process and does an execve to call /usr/bin/ceph-osd as systemd had originally intended. The pinning is not changed by the execve. Would something similar work with cephadm? Probably, but offhand I have no idea how to implement it. Cheers, Tyler _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx