If you wait longer, you should see the remaining OSDs get marked down. We detect down OSDs in two ways: 1) OSDs heartbeat each other frequently and issue reports when the heartbeat responses take too long. (This is the main way.) 2) OSDs periodically send statistics to the monitors, and if these statistics do not arrive for a *very* long time (roughly 15 minutes, by default) the monitor will mark the OSD down. It looks like when restarting, you did it so that the first OSD was marked down by the other OSDs in their timeframe (about 30 seconds), but you killed the others quickly enough that they were never marked down by the other. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Mar 31, 2014 at 12:44 PM, Dan Koren <dnk@xxxxxxxxxxxxx> wrote: > On a 4 node cluster (admin + 3 mon/osd nodes) I see the following shortly > after rebooting the cluster and waiting for a couple of minutes: > > root@rts23:~# ps -ef | grep ceph && ceph osd tree > root 4183 1 0 12:09 ? 00:00:00 /usr/bin/ceph-mon > --cluster=ceph -i rts23 -f > root 5771 5640 0 12:30 pts/0 00:00:00 grep --color=auto ceph > # id weight type name up/down reweight > -1 0.94 root default > -2 0.31 host rts22 > 0 0.31 osd.0 down 0 > -3 0.31 host rts21 > 1 0.31 osd.1 up 1 > -4 0.32 host rts23 > 2 0.32 osd.2 up 1 > > > It seems rather odd that ceph reports 2 OSDs up while ps does not show > any OSD daemons running (ceph osd tree output is the same on all 4 nodes). > > ceph status shows: > > root@rts23:~# ceph status > cluster 6149cebd-b619-4709-9fec-00fd8bc210a3 > health HEALTH_WARN 192 pgs degraded; 192 pgs stale; 192 pgs stuck > stale; 192 pgs > stuck unclean; recovery 10242/20484 objects degraded (50.000%); 2/2 in osds > are down; > clock skew detected on mon.rts23 > monmap e1: 3 mons at > {rts21=172.29.0.21:6789/0,rts22=172.29.0.22:6789/0,rts23= > 172.29.0.23:6789/0}, election epoch 12, quorum 0,1,2 rts21,rts22,rts23 > osdmap e25: 3 osds: 0 up, 2 in > pgmap v445: 192 pgs, 3 pools, 40960 MB data, 10242 objects > 10305 MB used, 641 GB / 651 GB avail > 10242/20484 objects degraded (50.000%) > 192 stale+active+degraded > > > How can OSDs be "up" when no OSD daemons are running in the cluster? > > MTIA, > > dk > > Dan Koren > Director of Software > DATERA | 650.210.7910 | @dateranews > dnk@xxxxxxxxx > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com