Hi Dave, Nothing sticks out to me as being the cause of the problem. If you restart one of the OSD’s is there anything obvious in the logs? Apart from that, I’m out of ideas I’m afraid. Nick From: Dave Durkee [mailto:dave@xxxxxxx] Nick I rebuilt the cluster using the following commands: ceph-deploy purge admin mon osd1 osd2 osd3 ceph-deploy purgedata admin mon osd1 osd2 osd3 ceph-deploy forgetkeys rm -f ceph.bootstrap-rgw.keyring ceph.log ceph.conf ceph-deploy new mon cat ceph.conf.add >> ceph.conf ceph-deploy install admin mon osd1 osd2 osd3 ceph-deploy mon create-initial ceph-deploy disk zap osd1:sdc osd1:sdd osd1:sde ceph-deploy osd create osd1:sdc:/journal/c osd1:sdd:/journal/d osd1:sde:/journal/e ceph-deploy admin admin mon osd1 osd2 osd3 chmod +r /etc/ceph/ceph.client.admin.keyring ceph health I received no errors during the above process. Here is a copy of the ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 172.17.1.16 mon_initial_members = mon fsid = f070bdc0-ccff-4d1d-bb3e-071d695ed629 osd pool default size = 2 public network = 172.17.1.0/24 cluster network = 10.0.0.0/24 The ceph health detail produces the following output: HEALTH_WARN 24 pgs degraded; 24 pgs stuck degraded; 64 pgs stuck unclean; 24 pgs stuck undersized; 24 pgs undersized; too few PGs per OSD (21 < min 30) pg 0.22 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.21 is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.20 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.1f is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.1e is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.1d is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.1c is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.1b is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.1a is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.19 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.18 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.17 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.16 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.15 is stuck unclean since forever, current state active, last acting [2,1] pg 0.14 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.13 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.12 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.11 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.10 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.f is stuck unclean since forever, current state active, last acting [2,1] pg 0.e is stuck unclean since forever, current state active, last acting [2,1] pg 0.d is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.c is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.b is stuck unclean since forever, current state active, last acting [2,1] pg 0.a is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.9 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.8 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.7 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.6 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.5 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.4 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.3 is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.2 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.1 is stuck unclean since forever, current state active, last acting [2,1] pg 0.0 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.3f is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.3e is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.3d is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.3c is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.3b is stuck unclean since forever, current state active, last acting [2,1] pg 0.3a is stuck unclean since forever, current state active, last acting [2,1] pg 0.39 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.38 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.37 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.36 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.35 is stuck unclean since forever, current state active, last acting [2,1] pg 0.34 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.33 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.32 is stuck unclean since forever, current state active, last acting [2,1] pg 0.31 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.30 is stuck unclean since forever, current state active, last acting [2,1] pg 0.2f is stuck unclean since forever, current state active, last acting [2,1] pg 0.2e is stuck unclean since forever, current state active, last acting [2,1] pg 0.2d is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.2c is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.2b is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.2a is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.29 is stuck unclean since forever, current state active+undersized+degraded, last acting [0] pg 0.28 is stuck unclean since forever, current state active, last acting [2,1] pg 0.27 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.26 is stuck unclean since forever, current state active+remapped, last acting [1,0] pg 0.25 is stuck unclean since forever, current state active, last acting [2,1] pg 0.24 is stuck unclean since forever, current state active, last acting [2,1] pg 0.23 is stuck unclean since forever, current state active+remapped, last acting [2,0] pg 0.1f is stuck undersized for 1650.474153, current state active+undersized+degraded, last acting [0] pg 0.1e is stuck undersized for 1650.488904, current state active+undersized+degraded, last acting [0] pg 0.1c is stuck undersized for 1650.489953, current state active+undersized+degraded, last acting [0] pg 0.19 is stuck undersized for 1650.491760, current state active+undersized+degraded, last acting [0] pg 0.17 is stuck undersized for 1650.492908, current state active+undersized+degraded, last acting [0] pg 0.16 is stuck undersized for 1650.493515, current state active+undersized+degraded, last acting [0] pg 0.11 is stuck undersized for 1650.496410, current state active+undersized+degraded, last acting [0] pg 0.10 is stuck undersized for 1650.497174, current state active+undersized+degraded, last acting [0] pg 0.c is stuck undersized for 1650.499547, current state active+undersized+degraded, last acting [0] pg 0.a is stuck undersized for 1650.500749, current state active+undersized+degraded, last acting [0] pg 0.6 is stuck undersized for 1650.503065, current state active+undersized+degraded, last acting [0] pg 0.5 is stuck undersized for 1650.503638, current state active+undersized+degraded, last acting [0] pg 0.4 is stuck undersized for 1650.504332, current state active+undersized+degraded, last acting [0] pg 0.2 is stuck undersized for 1650.517230, current state active+undersized+degraded, last acting [0] pg 0.0 is stuck undersized for 1650.518257, current state active+undersized+degraded, last acting [0] pg 0.3c is stuck undersized for 1649.521029, current state active+undersized+degraded, last acting [0] pg 0.39 is stuck undersized for 1649.704285, current state active+undersized+degraded, last acting [0] pg 0.38 is stuck undersized for 1649.896024, current state active+undersized+degraded, last acting [0] pg 0.36 is stuck undersized for 1650.037711, current state active+undersized+degraded, last acting [0] pg 0.33 is stuck undersized for 1650.212724, current state active+undersized+degraded, last acting [0] pg 0.2c is stuck undersized for 1650.466254, current state active+undersized+degraded, last acting [0] pg 0.2b is stuck undersized for 1650.467129, current state active+undersized+degraded, last acting [0] pg 0.2a is stuck undersized for 1650.467775, current state active+undersized+degraded, last acting [0] pg 0.29 is stuck undersized for 1650.468388, current state active+undersized+degraded, last acting [0] pg 0.1f is stuck degraded for 1650.474301, current state active+undersized+degraded, last acting [0] pg 0.1e is stuck degraded for 1650.489051, current state active+undersized+degraded, last acting [0] pg 0.1c is stuck degraded for 1650.490100, current state active+undersized+degraded, last acting [0] pg 0.19 is stuck degraded for 1650.491908, current state active+undersized+degraded, last acting [0] pg 0.17 is stuck degraded for 1650.493055, current state active+undersized+degraded, last acting [0] pg 0.16 is stuck degraded for 1650.493662, current state active+undersized+degraded, last acting [0] pg 0.11 is stuck degraded for 1650.496557, current state active+undersized+degraded, last acting [0] pg 0.10 is stuck degraded for 1650.497321, current state active+undersized+degraded, last acting [0] pg 0.c is stuck degraded for 1650.499694, current state active+undersized+degraded, last acting [0] pg 0.a is stuck degraded for 1650.500897, current state active+undersized+degraded, last acting [0] pg 0.6 is stuck degraded for 1650.503213, current state active+undersized+degraded, last acting [0] pg 0.5 is stuck degraded for 1650.503786, current state active+undersized+degraded, last acting [0] pg 0.4 is stuck degraded for 1650.504480, current state active+undersized+degraded, last acting [0] pg 0.2 is stuck degraded for 1650.517378, current state active+undersized+degraded, last acting [0] pg 0.0 is stuck degraded for 1650.518404, current state active+undersized+degraded, last acting [0] pg 0.3c is stuck degraded for 1649.521177, current state active+undersized+degraded, last acting [0] pg 0.39 is stuck degraded for 1649.704432, current state active+undersized+degraded, last acting [0] pg 0.38 is stuck degraded for 1649.896170, current state active+undersized+degraded, last acting [0] pg 0.36 is stuck degraded for 1650.037859, current state active+undersized+degraded, last acting [0] pg 0.33 is stuck degraded for 1650.212872, current state active+undersized+degraded, last acting [0] pg 0.2c is stuck degraded for 1650.466402, current state active+undersized+degraded, last acting [0] pg 0.2b is stuck degraded for 1650.467276, current state active+undersized+degraded, last acting [0] pg 0.2a is stuck degraded for 1650.467922, current state active+undersized+degraded, last acting [0] pg 0.29 is stuck degraded for 1650.468535, current state active+undersized+degraded, last acting [0] pg 0.1f is active+undersized+degraded, acting [0] pg 0.1e is active+undersized+degraded, acting [0] pg 0.1c is active+undersized+degraded, acting [0] pg 0.19 is active+undersized+degraded, acting [0] pg 0.17 is active+undersized+degraded, acting [0] pg 0.16 is active+undersized+degraded, acting [0] pg 0.11 is active+undersized+degraded, acting [0] pg 0.10 is active+undersized+degraded, acting [0] pg 0.c is active+undersized+degraded, acting [0] pg 0.a is active+undersized+degraded, acting [0] pg 0.6 is active+undersized+degraded, acting [0] pg 0.5 is active+undersized+degraded, acting [0] pg 0.4 is active+undersized+degraded, acting [0] pg 0.2 is active+undersized+degraded, acting [0] pg 0.0 is active+undersized+degraded, acting [0] pg 0.3c is active+undersized+degraded, acting [0] pg 0.39 is active+undersized+degraded, acting [0] pg 0.38 is active+undersized+degraded, acting [0] pg 0.36 is active+undersized+degraded, acting [0] pg 0.33 is active+undersized+degraded, acting [0] pg 0.2c is active+undersized+degraded, acting [0] pg 0.2b is active+undersized+degraded, acting [0] pg 0.2a is active+undersized+degraded, acting [0] pg 0.29 is active+undersized+degraded, acting [0] too few PGs per OSD (21 < min 30) Each OSD was 1 disk of 500GB and a file system journal on another disk. I verified the network on all of the hosts and all is well. I am not using jumbo frames yet as I want to get everything working with stock GB networking. The Mon, and the OSD hosts have 2 nics separated by vlan tagging. I have configured a public network and a cluster network. The public network is 172.17.1/24 and the cluster network is 10/24. The hosts table on each node only has entries for the public network names and ip’s. Here is a listing of ceph pg dump dumped all in format plain version 25 stamp 2015-06-26 09:20:31.526802 last_osdmap_epoch 15 last_pg_scan 1 full_ratio 0.95 nearfull_ratio 0.85 pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported upup_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 0.22 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.724003 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215311 0'0 2015-06-26 09:17:31.215311 0.21 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.372001 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215308 0'0 2015-06-26 09:17:31.215308 0.20 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.720347 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215305 0'0 2015-06-26 09:17:31.215305 0.1f 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.674405 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215303 0'0 2015-06-26 09:17:31.215303 0.1e 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.674307 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215300 0'0 2015-06-26 09:17:31.215300 0.1d 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.371586 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215297 0'0 2015-06-26 09:17:31.215297 0.1c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.674360 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215294 0'0 2015-06-26 09:17:31.215294 0.1b 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.719656 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215289 0'0 2015-06-26 09:17:31.215289 0.1a 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.369325 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215286 0'0 2015-06-26 09:17:31.215286 0.19 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.758903 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215283 0'0 2015-06-26 09:17:31.215283 0.18 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.764594 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215281 0'0 2015-06-26 09:17:31.215281 0.17 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.758140 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215278 0'0 2015-06-26 09:17:31.215278 0.16 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.758084 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215275 0'0 2015-06-26 09:17:31.215275 0.15 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.370381 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215272 0'0 2015-06-26 09:17:31.215272 0.14 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.847404 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215267 0'0 2015-06-26 09:17:31.215267 0.13 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.846842 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215264 0'0 2015-06-26 09:17:31.215264 0.12 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.846877 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215262 0'0 2015-06-26 09:17:31.215262 0.11 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.756301 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215259 0'0 2015-06-26 09:17:31.215259 0.10 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.756269 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215256 0'0 2015-06-26 09:17:31.215256 0.f 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.366841 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215253 0'0 2015-06-26 09:17:31.215253 0.e 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.365962 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215250 0'0 2015-06-26 09:17:31.215250 0.d 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.768589 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215247 0'0 2015-06-26 09:17:31.215247 0.c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.756108 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215244 0'0 2015-06-26 09:17:31.215244 0.b 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.372110 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215241 0'0 2015-06-26 09:17:31.215241 0.a 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.756028 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215238 0'0 2015-06-26 09:17:31.215238 0.9 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.767778 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215235 0'0 2015-06-26 09:17:31.215235 0.8 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.767963 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215232 0'0 2015-06-26 09:17:31.215232 0.7 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.767434 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215229 0'0 2015-06-26 09:17:31.215229 0.6 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.683564 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215227 0'0 2015-06-26 09:17:31.215227 0.5 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.683782 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215224 0'0 2015-06-26 09:17:31.215224 0.4 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.683467 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215221 0'0 2015-06-26 09:17:31.215221 0.3 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.370531 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215218 0'0 2015-06-26 09:17:31.215218 0.2 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.684159 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215215 0'0 2015-06-26 09:17:31.215215 0.1 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.368246 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215212 0'0 2015-06-26 09:17:31.215212 0.0 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.683089 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215207 0'0 2015-06-26 09:17:31.215207 0.3f 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.368012 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215426 0'0 2015-06-26 09:17:31.215426 0.3e 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.367199 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215424 0'0 2015-06-26 09:17:31.215424 0.3d 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.719736 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215420 0'0 2015-06-26 09:17:31.215420 0.3c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.682134 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215418 0'0 2015-06-26 09:17:31.215418 0.3b 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.368153 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215415 0'0 2015-06-26 09:17:31.215415 0.3a 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.372249 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215412 0'0 2015-06-26 09:17:31.215412 0.39 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.676675 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215409 0'0 2015-06-26 09:17:31.215409 0.38 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.677069 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215406 0'0 2015-06-26 09:17:31.215406 0.37 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.765897 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215403 0'0 2015-06-26 09:17:31.215403 0.36 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.676314 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215400 0'0 2015-06-26 09:17:31.215400 0.35 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.371676 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215398 0'0 2015-06-26 09:17:31.215398 0.34 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.765998 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215395 0'0 2015-06-26 09:17:31.215395 0.33 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.675119 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215392 0'0 2015-06-26 09:17:31.215392 0.32 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.370631 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215389 0'0 2015-06-26 09:17:31.215389 0.31 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.764606 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215358 0'0 2015-06-26 09:17:31.215358 0.30 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.368870 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215355 0'0 2015-06-26 09:17:31.215355 0.2f 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.368309 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215352 0'0 2015-06-26 09:17:31.215352 0.2e 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.370069 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215349 0'0 2015-06-26 09:17:31.215349 0.2d 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.369450 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215346 0'0 2015-06-26 09:17:31.215346 0.2c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.682415 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215342 0'0 2015-06-26 09:17:31.215342 0.2b 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.682323 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215339 0'0 2015-06-26 09:17:31.215339 0.2a 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.682232 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215336 0'0 2015-06-26 09:17:31.215336 0.29 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-06-26 09:18:03.677627 0'0 5:4 [0] 0 [0] 0 0'0 2015-06-26 09:17:31.215333 0'0 2015-06-26 09:17:31.215333 0.28 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.367862 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215330 0'0 2015-06-26 09:17:31.215330 0.27 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.721827 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215327 0'0 2015-06-26 09:17:31.215327 0.26 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:17.721741 0'0 10:8 [1] 1 [1,0] 1 0'0 2015-06-26 09:17:31.215324 0'0 2015-06-26 09:17:31.215324 0.25 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.365162 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215322 0'0 2015-06-26 09:17:31.215322 0.24 0 0 0 0 0 0 0 0 active 2015-06-26 09:18:27.366513 0'0 15:6 [2] 2 [2,1] 2 0'0 2015-06-26 09:17:31.215319 0'0 2015-06-26 09:17:31.215319 0.23 0 0 0 0 0 0 0 0 active+remapped 2015-06-26 09:18:27.365571 0'0 15:8 [2] 2 [2,0] 2 0'0 2015-06-26 09:17:31.215314 0'0 2015-06-26 09:17:31.215314 pool 0 0 0 0 0 0 0 0 0 sum 0 0 0 0 0 0 0 0 osdstat kbused kbavail kb hb in hb out 0 34348 488112724 488147072 [1] [] 1 33860 488113212 488147072 [0,2] [] 2 33712 488113360 488147072 [0,1] [] sum 101920 1464339296 1464441216 What are the steps I should take to bring the cluster into a healthy state? Is now the time to run ‘ceph osd pool set rbd pg_num 64’? Thanks for your help! Best, Dave Durkee From: Nick Fisk [mailto:nick@xxxxxxxxxx] Ok, some things to check/confirm - Make sure all your networking is ok, we have seen lots of problems related to jumbo frames not being correctly configured across nodes/switches. Test with pinging large packets between hosts. This includes separate public/cluster networks. - Run ceph health detail – Does it show anything interesting? - Your pool is definitely a 2 way replication pool? - Run a ceph pg dump, can you see a pattern amongst the pgs that have problems? From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dave Durkee Nick, I removed the failed OSD’s yet I am still in the same state? ceph> status cluster b4419183-5320-4701-aae2-eb61e186b443 health HEALTH_WARN 32 pgs degraded 64 pgs stale 32 pgs stuck degraded 246 pgs stuck inactive 64 pgs stuck stale 310 pgs stuck unclean 32 pgs stuck undersized 32 pgs undersized pool rbd pg_num 310 > pgp_num 64 monmap e1: 1 mons at {mon=172.17.1.16:6789/0} election epoch 1, quorum 0 mon osdmap e82: 9 osds: 9 up, 9 in pgmap v196: 310 pgs, 1 pools, 0 bytes data, 0 objects 303 MB used, 4189 GB / 4189 GB avail 246 creating 32 stale+active+undersized+degraded 32 stale+active+remapped ceph> osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 4.04997 root default -2 1.34999 host osd1 2 0.45000 osd.2 up 1.00000 1.00000 3 0.45000 osd.3 up 1.00000 1.00000 10 0.45000 osd.10 up 1.00000 1.00000 -3 1.34999 host osd2 4 0.45000 osd.4 up 1.00000 1.00000 5 0.45000 osd.5 up 1.00000 1.00000 6 0.45000 osd.6 up 1.00000 1.00000 -4 1.34999 host osd3 7 0.45000 osd.7 up 1.00000 1.00000 8 0.45000 osd.8 up 1.00000 1.00000 9 0.45000 osd.9 up 1.00000 1.00000 ceph> osd pool set rbd pgp_num 310 Error: 16 EBUSY Status: currently creating pgs, wait ceph> Dave Durkee From: Nick Fisk [mailto:nick@xxxxxxxxxx] Hi Dave, It can’t increase the pgp’s because the pg’s are still being created. I can see you currently have 2 OSD’s down, not 100% certain this is the cause, but you might to try and get them back online or remove them if they no longer exist. Nick From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dave Durkee ceph> osd pool set rbd pgp_num 310 Error: 16 EBUSY Status: currently creating pgs, wait What does the above mean? Dave Durkee From: Nick Fisk [mailto:nick@xxxxxxxxxx] Try ceph osd pool set rbd pgp_num 310 From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dave Durkee I just built a small lab cluster. 1 mon node, 3 osd nodes with 3 ceph disks and 1 os/journal disk, an admin vm and 3 client vm’s. I followed the preflight and install instructions and when I finished adding the osd’s I ran a ceph status and got the following: ceph> status cluster b4419183-5320-4701-aae2-eb61e186b443 health HEALTH_WARN 32 pgs degraded 64 pgs stale 32 pgs stuck degraded 246 pgs stuck inactive 64 pgs stuck stale 310 pgs stuck unclean 32 pgs stuck undersized 32 pgs undersized pool rbd pg_num 310 > pgp_num 64 monmap e1: 1 mons at {mon=172.17.1.16:6789/0} election epoch 2, quorum 0 mon osdmap e49: 11 osds: 9 up, 9 in pgmap v122: 310 pgs, 1 pools, 0 bytes data, 0 objects 298 MB used, 4189 GB / 4189 GB avail 246 creating 32 stale+active+undersized+degraded 32 stale+active+remapped ceph> health HEALTH_WARN 32 pgs degraded; 64 pgs stale; 32 pgs stuck degraded; 246 pgs stuck inactive; 64 pgs stuck stale; 310 pgs stuck unclean; 32 pgs stuck undersized; 32 pgs undersized; pool rbd pg_num 310 > pgp_num 64 ceph> quorum_status {"election_epoch":2,"quorum":[0],"quorum_names":["mon"],"quorum_leader_name":"mon","monmap":{"epoch":1,"fsid":"b4419183-5320-4701-aae2-eb61e186b443","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"mon","addr":"172.17.1.16:6789\/0"}]}} ceph> mon_status {"name":"mon","rank":0,"state":"leader","election_epoch":2,"quorum":[0],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":1,"fsid":"b4419183-5320-4701-aae2-eb61e186b443","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"mon","addr":"172.17.1.16:6789\/0"}]}} ceph> osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 4.94997 root default -2 2.24998 host osd1 0 0.45000 osd.0 down 0 1.00000 1 0.45000 osd.1 down 0 1.00000 2 0.45000 osd.2 up 1.00000 1.00000 3 0.45000 osd.3 up 1.00000 1.00000 10 0.45000 osd.10 up 1.00000 1.00000 -3 1.34999 host osd2 4 0.45000 osd.4 up 1.00000 1.00000 5 0.45000 osd.5 up 1.00000 1.00000 6 0.45000 osd.6 up 1.00000 1.00000 -4 1.34999 host osd3 7 0.45000 osd.7 up 1.00000 1.00000 8 0.45000 osd.8 up 1.00000 1.00000 9 0.45000 osd.9 up 1.00000 1.00000 Admin-node: [root@admin test-cluster]# cat ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 172.17.1.16 mon_initial_members = mon fsid = b4419183-5320-4701-aae2-eb61e186b443 osd pool default size = 2 public network = 172.17.1.0/24 cluster network = 10.0.0.0/24 How do I diagnose and solve the cluster health issue? Do you need any additional information to help with the diag process? Thanks!! Dave |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com