OSDs not coming up on one host

Jacob Reid <lists-ceph@xxxxxxxxxxxxxxxx> · Wed, 8 Apr 2015 15:55:17 +0100

I have a cluster of 3 servers (recently updated from 0.80.5 to 0.80.9), each running 4-6 osds as single disks, journaled to a partition each on an SSD, with 3 mons on separate hosts. Recently, I started taking the hosts down to move disks between controllers and add extra disk capacity before bringing them back into the cluster. This worked fine on the first host, which is now back in with 6 osds. However, on doing the same with the second, the osds are (re-)created properly, but never join the cluster in an 'up' state; they just stay as 'down'. The actual osd processes themselves show a status of 'booting' they never leave.

$ ceph osd tree
# id    weight    type name    up/down    reweight
-1    16    root default
-2    6        host ceph01
0    1            osd.0    up    1   
3    1            osd.3    up    1   
1    1            osd.1    up    1   
2    1            osd.2    up    1   
12    1            osd.12    up    1   
13    1            osd.13    up    1   
-3    6        host ceph02
7    1            osd.7    down    0   
4    1            osd.4    down    0   
6    1            osd.6    down    0   
14    1            osd.14    down    0   
5    1            osd.5    down    0   
15    1            osd.15    down    0   
-4    4        host ceph03
8    1            osd.8    up    1   
9    1            osd.9    up    1   
10    1            osd.10    up    1   
11    1            osd.11    up    1   

In the log for an osd that stays down, I see the normal starting up messages, then just the following:

2015-04-08 14:22:38.324037 7f683b90c780  0 osd.15 3808 done with init, starting boot process
2015-04-08 14:26:23.614375 7f68235fe700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f68235fe700' had timed out after 4
2015-04-08 14:26:23.614397 7f6824e01700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6824e01700' had timed out after 4
2015-04-08 14:26:23.614387 7f6824600700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f6824600700' had timed out after 4
2015-04-08 14:26:23.614404 7f6825602700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6825602700' had timed out after 4
2015-04-08 14:26:23.614410 7f6823dff700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f6823dff700' had timed out after 4
2015-04-08 14:26:23.614431 7f682f55e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682f55e700' had timed out after 4
2015-04-08 14:26:23.615003 7f6837d6f700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f682fd5f700' had timed out after 4
2015-04-08 14:26:23.615299 7f682fd5f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682fd5f700' had timed out after 4
2015-04-08 14:26:53.066311 7f682f55e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682f55e700' had timed out after 4
2015-04-08 14:26:53.066777 7f682fd5f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682fd5f700' had timed out after 4
2015-04-08 14:28:23.237330 7f6823dff700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f6823dff700' had timed out after 4
2015-04-08 14:28:23.237646 7f6824600700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f6824600700' had timed out after 4
2015-04-08 14:28:23.237690 7f68235fe700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f68235fe700' had timed out after 4
2015-04-08 14:28:23.238010 7f6824e01700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6824e01700' had timed out after 4
2015-04-08 14:28:23.238051 7f6825602700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6825602700' had timed out after 4
2015-04-08 14:29:53.469859 7f6824e01700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6824e01700' had timed out after 4
2015-04-08 14:29:53.469882 7f68235fe700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f68235fe700' had timed out after 4
2015-04-08 14:29:53.469895 7f6825602700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6825602700' had timed out after 4
2015-04-08 14:29:53.469900 7f6823dff700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f6823dff700' had timed out after 4
2015-04-08 14:29:53.469947 7f6824600700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f6824600700' had timed out after 4
2015-04-08 14:31:45.141384 7f682f55e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682f55e700' had timed out after 4
2015-04-08 14:31:45.141425 7f6824600700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f6824600700' had timed out after 4
2015-04-08 14:31:45.141603 7f6837d6f700  1 heartbeat_map is_healthy 'OSD::command_tp thread 0x7f68235fe700' had timed out after 4
2015-04-08 14:31:45.141614 7f6837d6f700  1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f6823dff700' had timed out after 4
2015-04-08 14:31:45.141616 7f6837d6f700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f6824e01700' had timed out after 4
2015-04-08 14:31:45.141619 7f6837d6f700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f6825602700' had timed out after 4
2015-04-08 14:31:45.141621 7f6837d6f700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f682fd5f700' had timed out after 4
2015-04-08 14:31:45.141630 7f68235fe700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f68235fe700' had timed out after 4
2015-04-08 14:31:45.141661 7f682fd5f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682fd5f700' had timed out after 4
2015-04-08 14:31:45.141925 7f6825602700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6825602700' had timed out after 4
2015-04-08 14:31:45.141966 7f6824e01700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6824e01700' had timed out after 4
2015-04-08 14:31:45.141987 7f6823dff700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f6823dff700' had timed out after 4
2015-04-08 14:38:23.224172 7f6825602700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6825602700' had timed out after 4
2015-04-08 14:38:23.224379 7f6824e01700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6824e01700' had timed out after 4
2015-04-08 14:38:23.224410 7f68235fe700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f68235fe700' had timed out after 4
2015-04-08 14:38:23.225392 7f6824600700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f6824600700' had timed out after 4
2015-04-08 14:38:23.225439 7f6823dff700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f6823dff700' had timed out after 4
2015-04-08 14:39:53.321219 7f6824e01700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6824e01700' had timed out after 4
2015-04-08 14:39:53.321372 7f68235fe700  1 heartbeat_map reset_timeout 'OSD::command_tp thread 0x7f68235fe700' had timed out after 4
2015-04-08 14:39:53.321522 7f6824600700  1 heartbeat_map reset_timeout 'OSD::recovery_tp thread 0x7f6824600700' had timed out after 4
2015-04-08 14:39:53.321568 7f6825602700  1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f6825602700' had timed out after 4
2015-04-08 14:39:53.321593 7f6823dff700  1 heartbeat_map reset_timeout 'OSD::disk_tp thread 0x7f6823dff700' had timed out after 4
2015-04-08 14:39:53.321586 7f682fd5f700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682fd5f700' had timed out after 4
2015-04-08 14:39:53.321661 7f6837d6f700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f682f55e700' had timed out after 4
2015-04-08 14:39:53.321688 7f682f55e700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f682f55e700' had timed out after

There is no fault with the hardware and the entire cluster is running the same version (0.80.9). Even waiting for hours does not allow them to rejoin the cluster, while they have the latest version of the map.

# ceph osd stat
cep     osdmap e3814: 16 osds: 10 up, 10 in
# ceph daemon osd.15 status
{ "cluster_fsid": "*****",
  "osd_fsid": "*****",
  "whoami": 15,
  "state": "booting",
  "oldest_map": 2527,
  "newest_map": 3814,
  "num_pgs": 0}

They remain in this booting state, as far as I can see, indefinitely. I have checked on the mons, and they are seeing traffic to and from the host with the down osds. No obvious errors in the logs on any of the osd or mon hosts.

Any idea?

Thanks
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com