Hi, Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our test cluster is unhealthy since about two weeks and can't recover itself anymore (unfortunately I skipped the upgrade to 10.2.5 because I missed the ".z" in "All clusters must first be upgraded to Jewel 10.2.z"). Immediately after the upgrade I saw the following in the OSD logs: s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half accept state just closed There are also missed heartbeats in the OSD logs, and the OSDs which don't send heartbeats have the following in their logs: 2017-02-08 19:44:51.367828 7f9be8c37700 1 heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after 15 2017-02-08 19:44:54.271010 7f9bc4e96700 1 heartbeat_map reset_timeout 'tp_osd thread tp_osd' had timed out after 15 During investigating we found out that some OSDs were lagging about 100-20000 OSD map epochs behind. The monitor publishes new epochs every few seconds, but the OSD daemons are pretty slow in applying them (up to a few minutes for 100 epochs). During recovery of the 24 OSDs of a storage node the CPU is running at almost 100% (the nodes have 16 real cores, or 32 with Hyper-Threading). We had at times servers where all 24 OSDs were up-to-date with the latest OSD map, but somehow they lost it and were lagging behind again. During recovery some OSDs used up to 25 GB of RAM, which led to out of memory and further lagging of the OSDs of the affected server. We already set the nodown, noout, norebalance, nobackfill, norecover, noscrub and nodeep-scrub flags to prevent OSD flapping and even more new OSD epochs. Is there anything we can do to let the OSDs recover? It seems that the servers don't have enough CPU resources for recovery. I already played around with the osd map message max setting (when I increased it to 1000 to speed up recovery, the OSDs didn't get any updates at all?), and the osd heartbeat grace and osd thread timeout settings (to give the overloaded server more time), but without success so far. I've seen errors related to the AsyncMessenger in the logs, so I reverted back to the SimpleMessenger (which was working successfully with Jewel). Cluster details: 6 storage nodes with 2x Intel Xeon E5-2630 v3 8x2.40GHz 256GB RAM Each storage node has 24 HDDs attached, one OSD per disk, journal on same disk 3 monitors in total, co-located with the storage nodes separate front and back network (10 Gbit) OS: CentOS 7.2.1511 Kernel: 4.9.8-1.el7.elrepo.x86_64 from elrepo.org Thanks, Andreas _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com