Re: OSD mystery

Dan Koren <dnk@xxxxxxxxxxxxx> · Mon, 31 Mar 2014 13:00:53 -0700

Hi Greg,
Thanks for the prompt response.
Sure enough, I do see all the OSDs are now down.

However, I do not understand the meaning of the
sentence about killing the OSDs. This was an OS

level reboot of the entire cluster, not issuing any
ceph commands either before or after the restart.

Doesn't Ceph recover transparently to the same
state it was in before the cluster rebooted? 

Thx,
dk

On Mon, Mar 31, 2014 at 12:47 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

If you wait longer, you should see the remaining OSDs get marked down.

We detect down OSDs in two ways:

1) OSDs heartbeat each other frequently and issue reports when the

heartbeat responses take too long. (This is the main way.)

2) OSDs periodically send statistics to the monitors, and if these

statistics do not arrive for a *very* long time (roughly 15 minutes,

by default) the monitor will mark the OSD down.

It looks like when restarting, you did it so that the first OSD was

marked down by the other OSDs in their timeframe (about 30 seconds),

but you killed the others quickly enough that they were never marked

down by the other.

-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Mar 31, 2014 at 12:44 PM, Dan Koren <dnk@xxxxxxxxxxxxx> wrote:

> On a 4 node cluster (admin + 3 mon/osd nodes) I see the following shortly

> after rebooting the cluster and waiting for a couple of minutes:

>

> root@rts23:~# ps -ef | grep ceph && ceph osd tree

> root       4183      1  0 12:09 ?        00:00:00 /usr/bin/ceph-mon

> --cluster=ceph -i rts23 -f

> root       5771   5640  0 12:30 pts/0    00:00:00 grep --color=auto ceph

> # id    weight  type name       up/down reweight

> -1      0.94    root default

> -2      0.31            host rts22

> 0       0.31                    osd.0   down    0

> -3      0.31            host rts21

> 1       0.31                    osd.1   up      1

> -4      0.32            host rts23

> 2       0.32                    osd.2   up      1

>

>

> It seems rather odd that ceph reports 2 OSDs up while ps does not show

> any OSD daemons running (ceph osd tree output is the same on all 4 nodes).

>

> ceph status shows:

>

> root@rts23:~# ceph status

>     cluster 6149cebd-b619-4709-9fec-00fd8bc210a3

>      health HEALTH_WARN 192 pgs degraded; 192 pgs stale; 192 pgs stuck

> stale; 192 pgs

> stuck unclean; recovery 10242/20484 objects degraded (50.000%); 2/2 in osds

> are down;

> clock skew detected on mon.rts23

>      monmap e1: 3 mons at

> {rts21=172.29.0.21:6789/0,rts22=172.29.0.22:6789/0,rts23=

> 172.29.0.23:6789/0}, election epoch 12, quorum 0,1,2 rts21,rts22,rts23

>      osdmap e25: 3 osds: 0 up, 2 in

>       pgmap v445: 192 pgs, 3 pools, 40960 MB data, 10242 objects

>             10305 MB used, 641 GB / 651 GB avail

>             10242/20484 objects degraded (50.000%)

>                  192 stale+active+degraded

>

>

> How can OSDs be "up" when no OSD daemons are running in the cluster?

>

> MTIA,

>

> dk

>

> Dan Koren

> Director of Software

> DATERA | 650.210.7910 | @dateranews

> dnk@xxxxxxxxx

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com