Dear list,
recently we experienced a short outage of our ceph storage, it was a
surprising cause, and probably indicates a subtle misconfiguration on
our part, I'm hoping for a useful suggestion ;-)
We are running a 3PB cluster with 21 osd nodes (spread across 3
datacenters), 3 mon/mgrs and 2mds nodes. Currently we are on octopus
15.2.16 (will upgrade to .17 soon).
The cluster has a single network interface (most are a bond) with
25Gbit/s. The physical nodes are all Dell AMD EPYC hardware.
The "cluster network" and "public network" configurations in
/etc/ceph/ceph.conf were all set to 0.0.0.0/0 since we only have a
single interface for all Ceph nodes (or so we thought...)
Our nodes are managed using cfengine3 (community), though we avoid
package upgrades during normal operation. New packages are installed
though, if commanded by cfengine.
Last Sunday at around 23:05 (local time) we experienced a short network
glitch (an MLAG link lost one sublink for 4 seconds)), our logs show
that it should have been relatively painless, since the peer-link took
over and after 4s the MLAG went back to FULL mode. However, it seems a
lot of ceph-osd services restarted or re-connected to the network and
failed to find the other ceph osd's. They consequently shut themselves
down. Shortly after this happened, the ceph services became unavailable
due to not enough osd nodes, so services of ours depending on ceph
became unavailable as well.
At this point I was able to start trying to fix it, I tried rebooting a
ceph osd machine and also tried restarting just the osd services on the
nodes. Both seemed to work and I could soon turn in when all was well again.
When trying to understand what had happened, we obviously suspected all
kinds of unrelated things (the ceph logs are way too noisy to quickly
get to the point), but one thing "osd.54 662927 set_numa_affinity unable
to identify public interface '' numa node: (2) No such file or
directory" turned out to be more important than we first thought after
some googling.
(https://forum.proxmox.com/threads/ceph-set_numa_affinity-unable-to-identify-public-interface.58239/)
We couldn't understand why the network glitch could cause such a massive
die-off of ceph-osd services.
In the assumption that sooner or later we were going to need some help
with this, it seemed a good idea to first try to get busy updating the
nodes to latest and then supported releases of ceph, so we started the
upgrade to 15.2.17 today.
The upgrade of the 2 virtual and 1 physical mon went OK, also the first
osd node was fine. But on the second osd node, the osd services would
not keep running after the upgrade+reboot.
Again we noticed this numa message, but now 6 times in a row and then
the nice: "_committed_osd_maps marked down 6 > osd_max_markdown_count 5
in last 600.000000 seconds, shutting down"
and
"received signal: Interrupt from Kernel"
At this point, one of noticed that a strange ip adress was mentioned;
169.254.0.2, it turns out that a recently added package (openmanage) and
some configuration had added this interface and address to hardware
nodes from Dell. For us, our single interface assumption is now out the
window and 0.0.0.0/0 is a bad idea in /etc/ceph/ceph.conf for public and
cluster network (though it's the same network for us).
Our 3 datacenters are on three different subnets so it becomes a bit
difficult to make it more specific. The nodes are all under the same
/16, so we can choose that, but it is starting to look like a weird
network setup.
I've always thought that this configuration was kind of non-intuitive
and I still do. And now it has bitten us :-(
Thanks for reading and if you have any suggestions on how to fix/prevent
this kind of error, we'll be glad to hear it!
Cheers
/Simon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx