Re: OSD won't come back "UP"

Reed Dier <reed.dier@xxxxxxxxxxx> · Fri, 7 Oct 2016 23:39:10 -0500

Resolved.

Apparently it took the OSD almost 2.5 hours to fully boot.

Had not seen this behavior before, but it eventually booted itself back into the crush map.

Bookend log stamps below.

2016-10-07 21:33:39.241720 7f3d59a97800  0 set uid:gid to 64045:64045 (ceph:ceph)
2016-10-07 23:53:29.617038 7f3d59a97800  0 osd.0 4360 done with init, starting boot process

I had noticed that there was a consistent read operation on the “down/out” osd tied to that osd’s PID, which led me to believe it was doing something with its time.

Also for reference, this was a 26% full 8TB disk.
Filesystem            1K-blocks        Used  Available Use% Mounted on
/dev/sda1            7806165996  1953556296 5852609700  26% /var/lib/ceph/osd/ceph-0

Reed

On Oct 7, 2016, at 7:33 PM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:

Attempting to adjust parameters of some of my recovery options, I restarted a single osd in the cluster with the following syntax:

sudo restart ceph-osd id=0

The osd restarts without issue, status shows running with the PID.

sudo status ceph-osd id=0
ceph-osd (ceph/0) start/running, process 2685

The osd marked itself down cleanly.

2016-10-07 19:36:20.872883 mon.0 10.0.1.249:6789/0 1475867 : cluster [INF] osd.0 marked itself down

2016-10-07 19:36:21.590874 mon.0 10.0.1.249:6789/0 1475869 : cluster [INF] osdmap e4361: 16 osds: 15 up, 16 in

The mon’s show this from one of many subsequent attempts to restart the osd.

2016-10-07 19:58:16.222949 mon.1 [INF] from='client.? 10.0.1.25:0/324114592' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
2016-10-07 19:58:16.223626 mon.0 [INF] from='client.6557620 :/0' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch

mon logs show this when grepping for the osd.0 in the mon log

2016-10-07 19:36:20.872882 7fd39aced700  0 log_channel(cluster) log [INF] : osd.0 marked itself down
2016-10-07 19:36:27.698708 7fd39aced700  0 log_channel(audit) log [INF] : from='client.6554095 :/0' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
2016-10-07 19:36:27.706374 7fd39aced700  0 mon.core@0(leader).osd e4363 create-or-move crush item name 'osd.0' initial_weight 7.2701 at location {host=node24,root=default}
2016-10-07 19:39:30.515494 7fd39aced700  0 log_channel(audit) log [INF] : from='client.6554587 :/0' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
2016-10-07 19:39:30.515618 7fd39aced700  0 mon.core@0(leader).osd e4363 create-or-move crush item name 'osd.0' initial_weight 7.2701 at location {host=node24,root=default}
2016-10-07 19:41:59.714517 7fd39b4ee700  0 log_channel(cluster) log [INF] : osd.0 out (down for 338.148761)

Everything running latest Jewel release

ceph --version
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

Any help with this is extremely appreciated. Hoping someone has dealt with this before.

Reed Dier

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com