Hello, I have a Ceph cluster (10.2.1) with 10 nodes, 3 mons and 290 OSDs. I have an instance of RGW with buckets data in EC pool 6+3. I've recently started testing cluster redundancy level by powering nodes off one by one. Suddenly I noticed that all monitors became crazy eating 100% CPU, in "perf top" it's like 80% ceph-mon [.] crush_hash32_3. "Ceph -s" is slowly working but the monmap election epoch is increasing contantly like 100 in a 5 minutes. So nothing really works as there is effectively no quorum. This already happened to me once I powered off 4 nodes of 10, the only thing that helped was removing all mons except one from monmap. cluster 5ddb8aab-49b4-4a63-918e-33c569e3101e health HEALTH_WARN 35 pgs backfill_wait 26126 pgs degraded 4 pgs recovering 2928 pgs recovery_wait 26130 pgs stuck unclean 26125 pgs undersized recovery 127536/334221 objects degraded (38.159%) recovery 139603/334221 objects misplaced (41.770%) too many PGs per OSD (1325 > max 1000) monmap e6: 3 mons at {ed-ds-c171=10.144.66.171:6789/0,ed-ds-c172=10.144.66.172:6789/0,ed-ds-c173=10.144.66.173:6789/0} election epoch 1284, quorum 0,1,2 ed-ds-c171,ed-ds-c172,ed-ds-c173 osdmap e3950: 290 osds: 174 up, 174 in; 19439 remapped pgs flags sortbitwise pgmap v241407: 26760 pgs, 16 pools, 143 GB data, 37225 objects 258 GB used, 949 TB / 949 TB avail 127536/334221 objects degraded (38.159%) 139603/334221 objects misplaced (41.770%) 11972 active+undersized+degraded 11187 active+undersized+degraded+remapped 2612 active+recovery_wait+undersized+degraded+remapped 630 active+clean 315 active+recovery_wait+undersized+degraded 35 active+undersized+degraded+remapped+wait_backfill 3 active+remapped 3 active+recovering+undersized+degraded+remapped 1 active 1 active+recovery_wait+degraded 1 active+recovering+undersized+degraded Logs from quorum leader with debug_mon=20 2016-05-16 17:34:41.318132 7f5079bb2700 5 mon.ed-ds-c171@0(leader).elector(1310) handle_propose from mon.2 2016-05-16 17:34:41.318134 7f5079bb2700 10 mon.ed-ds-c171@0(leader).elector(1310) handle_propose required features 9025616074506240, peer features 576460752032874495 2016-05-16 17:34:41.318136 7f5079bb2700 10 mon.ed-ds-c171@0(leader).elector(1310) bump_epoch 1310 to 1311 2016-05-16 17:34:41.318345 7f5079bb2700 10 mon.ed-ds-c171@0(leader) e6 join_election 2016-05-16 17:34:41.318352 7f5079bb2700 10 mon.ed-ds-c171@0(leader) e6 _reset 2016-05-16 17:34:41.318353 7f5079bb2700 10 mon.ed-ds-c171@0(leader) e6 cancel_probe_timeout (none scheduled) 2016-05-16 17:34:41.318355 7f5079bb2700 10 mon.ed-ds-c171@0(leader) e6 timecheck_finish 2016-05-16 17:34:41.318358 7f5079bb2700 15 mon.ed-ds-c171@0(leader) e6 health_tick_stop 2016-05-16 17:34:41.318359 7f5079bb2700 15 mon.ed-ds-c171@0(leader) e6 health_interval_stop 2016-05-16 17:34:41.318361 7f5079bb2700 10 mon.ed-ds-c171@0(leader) e6 scrub_event_cancel 2016-05-16 17:34:41.318363 7f5079bb2700 10 mon.ed-ds-c171@0(leader) e6 scrub_reset 2016-05-16 17:34:41.318368 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 start_election 2016-05-16 17:34:41.318371 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 _reset 2016-05-16 17:34:41.318372 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 cancel_probe_timeout (none scheduled) 2016-05-16 17:34:41.318372 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 timecheck_finish 2016-05-16 17:34:41.318373 7f5079bb2700 15 mon.ed-ds-c171@0(electing) e6 health_tick_stop 2016-05-16 17:34:41.318374 7f5079bb2700 15 mon.ed-ds-c171@0(electing) e6 health_interval_stop 2016-05-16 17:34:41.318375 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 scrub_event_cancel 2016-05-16 17:34:41.318376 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 scrub_reset 2016-05-16 17:34:41.318377 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 cancel_probe_timeout (none scheduled) 2016-05-16 17:34:41.318380 7f5079bb2700 0 log_channel(cluster) log [INF] : mon.ed-ds-c171 calling new monitor election 2016-05-16 17:34:41.318403 7f5079bb2700 5 mon.ed-ds-c171@0(electing).elector(1311) start -- can i be leader? 2016-05-16 17:34:41.318448 7f5079bb2700 1 mon.ed-ds-c171@0(electing).elector(1311) init, last seen epoch 1311 2016-05-16 17:34:41.318677 7f5079bb2700 20 mon.ed-ds-c171@0(electing) e6 _ms_dispatch existing session 0x55b390116a80 for mon.1 10.144.66.172:6789/0 2016-05-16 17:34:41.318681 7f5079bb2700 20 mon.ed-ds-c171@0(electing) e6 caps allow * 2016-05-16 17:34:41.318686 7f5079bb2700 20 is_capable service=mon command= read on cap allow * 2016-05-16 17:34:41.318688 7f5079bb2700 20 allow so far , doing grant allow * 2016-05-16 17:34:41.318689 7f5079bb2700 20 allow all 2016-05-16 17:34:41.318690 7f5079bb2700 10 mon.ed-ds-c171@0(electing) e6 received forwarded message from mon.1 10.144.66.172:6789/0 via mon.1 10.144.66.172:6789/0 Any help is appreciated! Best regards, Vasily. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com