Resolved.
Apparently it took the OSD almost 2.5 hours to fully boot.
Had not seen this behavior before, but it eventually booted itself back into the crush map.
Bookend log stamps below.
2016-10-07 21:33:39.241720 7f3d59a97800 0 set uid:gid to 64045:64045 (ceph:ceph) 2016-10-07 23:53:29.617038 7f3d59a97800 0 osd.0 4360 done with init, starting boot process I had noticed that there was a consistent read operation on the “down/out” osd tied to that osd’s PID, which led me to believe it was doing something with its time.
Also for reference, this was a 26% full 8TB disk. Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 7806165996 1953556296 5852609700 26% /var/lib/ceph/osd/ceph-0 Reed
Attempting to adjust parameters of some of my recovery options, I restarted a single osd in the cluster with the following syntax: sudo restart ceph-osd id=0
The osd restarts without issue, status shows running with the PID. sudo status ceph-osd id=0 ceph-osd (ceph/0) start/running, process 2685
The osd marked itself down cleanly. 2016-10-07 19:36:20.872883 mon.0 10.0.1.249:6789/0 1475867 : cluster [INF] osd.0 marked itself down
2016-10-07 19:36:21.590874 mon.0 10.0.1.249:6789/0 1475869 : cluster [INF] osdmap e4361: 16 osds: 15 up, 16 in
The mon’s show this from one of many subsequent attempts to restart the osd. 2016-10-07 19:58:16.222949 mon.1 [INF] from='client.? 10.0.1.25:0/324114592' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch 2016-10-07 19:58:16.223626 mon.0 [INF] from='client.6557620 :/0' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
mon logs show this when grepping for the osd.0 in the mon log 2016-10-07 19:36:20.872882 7fd39aced700 0 log_channel(cluster) log [INF] : osd.0 marked itself down 2016-10-07 19:36:27.698708 7fd39aced700 0 log_channel(audit) log [INF] : from='client.6554095 :/0' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch 2016-10-07 19:36:27.706374 7fd39aced700 0 mon.core@0(leader).osd e4363 create-or-move crush item name 'osd.0' initial_weight 7.2701 at location {host=node24,root=default} 2016-10-07 19:39:30.515494 7fd39aced700 0 log_channel(audit) log [INF] : from='client.6554587 :/0' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch 2016-10-07 19:39:30.515618 7fd39aced700 0 mon.core@0(leader).osd e4363 create-or-move crush item name 'osd.0' initial_weight 7.2701 at location {host=node24,root=default} 2016-10-07 19:41:59.714517 7fd39b4ee700 0 log_channel(cluster) log [INF] : osd.0 out (down for 338.148761)
Everything running latest Jewel release ceph --version ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
Any help with this is extremely appreciated. Hoping someone has dealt with this before. Reed Dier
|