Dear All, I recently upgraded our Ceph cluster from 12.2.4 to 12.2.5 & simultaneously upgraded the OS from Scientific Linux 7.4 to 7.5 After reboot, 0.7% objects were misplaced and many pgs degraded. the cluster had no client connections, so I speeded up recovery with: ceph tell 'osd.*' injectargs '--osd-max-backfills 16' cluster then rebalances at >6000 MB/s, but the number of misplaced objects started shooting up... In case something very nasty was going on, I set osd nodown, and rebooted the cluster. 21st May, Post reboot health status; pgs: 10003755/5184299696 objects degraded (0.193%) 282514666/5184299696 objects misplaced (5.449%) recovery: 1901 MB/s, 657 objects/s The cluster continued to mend, slowly this time (default osd-max-backfills) 28th May nodown flag(s) set; 24820486/5352446983 objects misplaced (0.464%) Degraded data redundancy: 816609/5352446983 objects degraded (0.015%), 179 pgs degraded, 6 pgs undersized 30th May nodown flag(s) set; 3571105/5667392354 objects misplaced (0.063%); Degraded data redundancy: 40/5667392354 objects degraded (0.000%) 1 pg degraded All good, so I thought, but this morning (31st May): nodown flag(s) set; 41264874/5190843723 objects misplaced (0.795%) Degraded data redundancy: 11795/5190843723 objects degraded (0.000%), 226 pgs degraded Of course I'm perplexed as to what might have caused this... Looking at /var/log/ceph.log-20180531.gz there is a sudden jump in objects misplaced at 22:55:28 2018-05-30 22:55:18.154529 mon.ceph2 mon.0 10.1.0.80:6789/0 71418 : cluster [WRN] Health check update: 2666818/5085379079 objects misplaced (0.052%) (OBJECT_MISPLACED) 2018-05-30 22:55:20.096386 mon.ceph2 mon.0 10.1.0.80:6789/0 72319 : cluster [WRN] Health check failed: Reduced data availability: 34 pgs peering (PG_AVAILABILITY) 2018-05-30 22:55:22.197206 mon.ceph2 mon.0 10.1.0.80:6789/0 72333 : cluster [WRN] Health check failed: Degraded data redundancy: 1123/5079163159 objects degraded (0.000%), 21 pgs degraded (PG_DEGRADED) 2018-05-30 22:55:23.155873 mon.ceph2 mon.0 10.1.0.80:6789/0 72335 : cluster [WRN] Health check update: 2666363/5079163159 objects misplaced (0.052%) (OBJECT_MISPLACED) 2018-05-30 22:55:25.450185 mon.ceph2 mon.0 10.1.0.80:6789/0 72336 : cluster [WRN] Health check update: Reduced data availability: 2 pgs inactive, 38 pgs peering (PG_AVAILABILITY) 2018-05-30 22:55:27.521142 mon.ceph2 mon.0 10.1.0.80:6789/0 72337 : cluster [WRN] Health check update: Degraded data redundancy: 13808/5085377819 objects degraded (0.000%), 270 pgs degraded (PG_DEGRADED) 2018-05-30 22:55:27.521181 mon.ceph2 mon.0 10.1.0.80:6789/0 72338 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive, 38 pgs peering) 2018-05-30 22:55:28.157397 mon.ceph2 mon.0 10.1.0.80:6789/0 72339 : cluster [WRN] Health check update: 54749389/5085377819 objects misplaced (1.077%) (OBJECT_MISPLACED) 2018-05-30 22:55:33.158644 mon.ceph2 mon.0 10.1.0.80:6789/0 72340 : cluster [WRN] Health check update: 54748082/5085377079 objects misplaced (1.077%) (OBJECT_MISPLACED) 2018-05-30 22:55:33.158698 mon.ceph2 mon.0 10.1.0.80:6789/0 72341 : cluster [WRN] Health check update: Degraded data redundancy: 13600/5085377079 objects degraded (0.000%), 265 pgs degraded (PG_DEGRADED) ceph-mgr.ceph1.log-20180531.gz has 455 identical entries logged in two seconds (not seen elsewhere in the mgr log) 2018-05-30 22:55:18.839956 7f9cd8590700 1 mgr[restful] Unknown request '' ceph-mon.ceph1.log-20180531.gz entries at this time are as follows: 2018-05-30 22:55:17.487698 7f3fef608700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1527717317487691, "job": 6580, "event": "table_file_deletion", "file_number": 801008} (snip - 12 similar lines) 2018-05-30 22:55:17.637385 7f3fef608700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1527717317637379, "job": 6580, "event": "table_file_deletion", "file_number": 800994} 2018-05-30 22:55:18.817637 7f3ff5614700 1 mon.ceph1@1(peon).osd e161984 e161984: 454 total, 453 up, 453 in 2018-05-30 22:55:19.121033 7f3ff7618700 1 'Monitor::cpu_tp thread 0x7f3ff7618700' had timed out after 0 2018-05-30 22:55:19.158228 7f3ff5e15700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7f3ff5e15700' had timed out after 0 2018-05-30 22:55:20.168560 7f3ff5614700 1 mon.ceph1@1(peon).osd e161985 e161985: 454 total, 453 up, 453 in Other info? # ceph balancer status { "active": true, "plans": [], "mode": "crush-compat" } One dead osd, down and empty, waiting to be replaced - presumably this is not complicating matters? # ceph osd df | grep 423 423 hdd 7.27730 0 0 0 0 0 0 0 Many thanks for any advice, Jake _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com