post-mortem of a ceph disruption

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear list,

recently we experienced a short outage of our ceph storage, it was a surprising cause, and probably indicates a subtle misconfiguration on our part, I'm hoping for a useful suggestion ;-)

We are running a 3PB cluster with 21 osd nodes (spread across 3 datacenters), 3 mon/mgrs and 2mds nodes. Currently we are on octopus 15.2.16 (will upgrade to .17 soon). The cluster has a single network interface (most are a bond) with 25Gbit/s. The physical nodes are all Dell AMD EPYC hardware.

The "cluster network" and "public network" configurations in /etc/ceph/ceph.conf were all set to 0.0.0.0/0 since we only have a single interface for all Ceph nodes (or so we thought...)

Our nodes are managed using cfengine3 (community), though we avoid package upgrades during normal operation. New packages are installed though, if commanded by cfengine.

Last Sunday at around 23:05 (local time) we experienced a short network glitch (an MLAG link lost one sublink for 4 seconds)), our logs show that it should have been relatively painless, since the peer-link took over and after 4s the MLAG went back to FULL mode. However, it seems a lot of ceph-osd services restarted or re-connected to the network and failed to find the other ceph osd's. They consequently shut themselves down. Shortly after this happened, the ceph services became unavailable due to not enough osd nodes, so services of ours depending on ceph became unavailable as well.

At this point I was able to start trying to fix it, I tried rebooting a ceph osd machine and also tried restarting just the osd services on the nodes. Both seemed to work and I could soon turn in when all was well again.

When trying to understand what had happened, we obviously suspected all kinds of unrelated things (the ceph logs are way too noisy to quickly get to the point), but one thing "osd.54 662927 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory" turned out to be more important than we first thought after some googling. (https://forum.proxmox.com/threads/ceph-set_numa_affinity-unable-to-identify-public-interface.58239/)

We couldn't understand why the network glitch could cause such a massive die-off of ceph-osd services. In the assumption that sooner or later we were going to need some help with this, it seemed a good idea to first try to get busy updating the nodes to latest and then supported releases of ceph, so we started the upgrade to 15.2.17 today.

The upgrade of the 2 virtual and 1 physical mon went OK, also the first osd node was fine. But on the second osd node, the osd services would not keep running after the upgrade+reboot.

Again we noticed this numa message, but now 6 times in a row and then the nice: "_committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down"
and
"received  signal: Interrupt from Kernel"

At this point, one of noticed that a strange ip adress was mentioned; 169.254.0.2, it turns out that a recently added package (openmanage) and some configuration had added this interface and address to hardware nodes from Dell. For us, our single interface assumption is now out the window and 0.0.0.0/0 is a bad idea in /etc/ceph/ceph.conf for public and cluster network (though it's the same network for us).

Our 3 datacenters are on three different subnets so it becomes a bit difficult to make it more specific. The nodes are all under the same /16, so we can choose that, but it is starting to look like a weird network setup. I've always thought that this configuration was kind of non-intuitive and I still do. And now it has bitten us :-(


Thanks for reading and if you have any suggestions on how to fix/prevent this kind of error, we'll be glad to hear it!

Cheers

/Simon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux