post-mortem of a ceph disruption

Simon Oosthoek <s.oosthoek@xxxxxxxxxxxxx> · Tue, 25 Oct 2022 17:08:05 +0200

Dear list,

recently we experienced a short outage of our ceph storage, it was a 
surprising cause, and probably indicates a subtle misconfiguration on 
our part, I'm hoping for a useful suggestion ;-)

We are running a 3PB cluster with 21 osd nodes (spread across 3 
datacenters), 3 mon/mgrs and 2mds nodes. Currently we are on octopus 
15.2.16 (will upgrade to .17 soon).
The cluster has a single network interface (most are a bond) with 
25Gbit/s. The physical nodes are all Dell AMD EPYC hardware.

The "cluster network" and "public network" configurations in 
/etc/ceph/ceph.conf were all set to 0.0.0.0/0 since we only have a 
single interface for all Ceph nodes (or so we thought...)

Our nodes are managed using cfengine3 (community), though we avoid 
package upgrades during normal operation. New packages are installed 
though, if commanded by cfengine.

Last Sunday at around 23:05 (local time) we experienced a short network 
glitch (an MLAG link lost one sublink for 4 seconds)), our logs show 
that it should have been relatively painless, since the peer-link took 
over and after 4s the MLAG went back to FULL mode. However, it seems a 
lot of ceph-osd services restarted or re-connected to the network and 
failed to find the other ceph osd's. They consequently shut themselves 
down. Shortly after this happened, the ceph services became unavailable 
due to not enough osd nodes, so services of ours depending on ceph 
became unavailable as well.

At this point I was able to start trying to fix it, I tried rebooting a 
ceph osd machine and also tried restarting just the osd services on the 
nodes. Both seemed to work and I could soon turn in when all was well again.

When trying to understand what had happened, we obviously suspected all 
kinds of unrelated things (the ceph logs are way too noisy to quickly 
get to the point), but one thing "osd.54 662927 set_numa_affinity unable 
to identify public interface '' numa node: (2) No such file or 
directory" turned out to be more important than we first thought after 
some googling. 
(https://forum.proxmox.com/threads/ceph-set_numa_affinity-unable-to-identify-public-interface.58239/)

We couldn't understand why the network glitch could cause such a massive 
die-off of ceph-osd services.
In the assumption that sooner or later we were going to need some help 
with this, it seemed a good idea to first try to get busy updating the 
nodes to latest and then supported releases of ceph, so we started the 
upgrade to 15.2.17 today.

The upgrade of the 2 virtual and 1 physical mon went OK, also the first 
osd node was fine. But on the second osd node, the osd services would 
not keep running after the upgrade+reboot.

Again we noticed this numa message, but now 6 times in a row and then 
the nice: "_committed_osd_maps marked down 6 > osd_max_markdown_count 5 
in last 600.000000 seconds, shutting down"
and
"received  signal: Interrupt from Kernel"

At this point, one of noticed that a strange ip adress was mentioned; 
169.254.0.2, it turns out that a recently added package (openmanage) and 
some configuration had added this interface and address to hardware 
nodes from Dell. For us, our single interface assumption is now out the 
window and 0.0.0.0/0 is a bad idea in /etc/ceph/ceph.conf for public and 
cluster network (though it's the same network for us).

Our 3 datacenters are on three different subnets so it becomes a bit 
difficult to make it more specific. The nodes are all under the same 
/16, so we can choose that, but it is starting to look like a weird 
network setup.
I've always thought that this configuration was kind of non-intuitive 
and I still do. And now it has bitten us :-(

Thanks for reading and if you have any suggestions on how to fix/prevent 
this kind of error, we'll be glad to hear it!

Cheers

/Simon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx