Re: OSD mystery

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 31 Mar 2014 12:47:12 -0700

If you wait longer, you should see the remaining OSDs get marked down.
We detect down OSDs in two ways:
1) OSDs heartbeat each other frequently and issue reports when the
heartbeat responses take too long. (This is the main way.)
2) OSDs periodically send statistics to the monitors, and if these
statistics do not arrive for a *very* long time (roughly 15 minutes,
by default) the monitor will mark the OSD down.

It looks like when restarting, you did it so that the first OSD was
marked down by the other OSDs in their timeframe (about 30 seconds),
but you killed the others quickly enough that they were never marked
down by the other.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Mar 31, 2014 at 12:44 PM, Dan Koren <dnk@xxxxxxxxxxxxx> wrote:
> On a 4 node cluster (admin + 3 mon/osd nodes) I see the following shortly
> after rebooting the cluster and waiting for a couple of minutes:
>
> root@rts23:~# ps -ef | grep ceph && ceph osd tree
> root       4183      1  0 12:09 ?        00:00:00 /usr/bin/ceph-mon
> --cluster=ceph -i rts23 -f
> root       5771   5640  0 12:30 pts/0    00:00:00 grep --color=auto ceph
> # id    weight  type name       up/down reweight
> -1      0.94    root default
> -2      0.31            host rts22
> 0       0.31                    osd.0   down    0
> -3      0.31            host rts21
> 1       0.31                    osd.1   up      1
> -4      0.32            host rts23
> 2       0.32                    osd.2   up      1
>
>
> It seems rather odd that ceph reports 2 OSDs up while ps does not show
> any OSD daemons running (ceph osd tree output is the same on all 4 nodes).
>
> ceph status shows:
>
> root@rts23:~# ceph status
>     cluster 6149cebd-b619-4709-9fec-00fd8bc210a3
>      health HEALTH_WARN 192 pgs degraded; 192 pgs stale; 192 pgs stuck
> stale; 192 pgs
> stuck unclean; recovery 10242/20484 objects degraded (50.000%); 2/2 in osds
> are down;
> clock skew detected on mon.rts23
>      monmap e1: 3 mons at
> {rts21=172.29.0.21:6789/0,rts22=172.29.0.22:6789/0,rts23=
> 172.29.0.23:6789/0}, election epoch 12, quorum 0,1,2 rts21,rts22,rts23
>      osdmap e25: 3 osds: 0 up, 2 in
>       pgmap v445: 192 pgs, 3 pools, 40960 MB data, 10242 objects
>             10305 MB used, 641 GB / 651 GB avail
>             10242/20484 objects degraded (50.000%)
>                  192 stale+active+degraded
>
>
> How can OSDs be "up" when no OSD daemons are running in the cluster?
>
> MTIA,
>
> dk
>
> Dan Koren
> Director of Software
> DATERA | 650.210.7910 | @dateranews
> dnk@xxxxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com