Re: Issues with a fresh cluster and HEALTH_WARN

Joshua Mesilane <joshuam@xxxxxxxxxxxxxxxxx> · Fri, 07 Jun 2013 10:54:45 +1000

Well,

I had a closer look at the logs and for some reason, while it listed the 
OSDs as being up and in to begin with, fairly shortly after I sent this 
email the two on one of the hosts went down. Turned out that the OSDs 
weren't mounted for some reason. After re-mounting and restarting the 
services it all came back online and I've got a healthy cluster.

Time is being synced by ntpd on the servers, so not sure what's going on 
there.

Cheers,
Josh

On 06/07/2013 10:47 AM, Jeff Bailey wrote:
You need to fix your clocks (usually with ntp).  According to the log
message they can be off by 50ms and yours seems to be about 85ms off.

On 6/6/2013 8:40 PM, Joshua Mesilane wrote:
Hi,

I'm currently evaulating ceph as a solution to some HA storage that
we're looking at. To test I have 3 servers, with two disks to be used
for OSDs on them (journals on the same disk as the OSD). I've deployed
the cluster with 3 mons (one on each server) 6 OSDs (2 on each server)
and 3 MDS (1 on each server)

I've built the cluster using ceph-deploy checked out from git on my
local workstation (Fedora 15) and the Severs themselves are running
CentOS 6.4

First note: It looks like the ceph-deploy tool, when you run
"ceph-deploy osd perpare host:device" is actually also activating the
OSD when it's done instead of waiting for you to run the ceph-deploy
osd activate command.

Question: Is ceph-deploy supposed to be writing out the [mon] and the
[osd] sections to the ceph.conf configuration file? I can't find any
reference to anything in the config file except for the [global]
section, and there are no other sections.

Question: Once I got all 6 of my OSDs online I'm getting the following
health error:

"health HEALTH_WARN 91 pgs degraded; 192 pgs stuck unclean; clock skew
detected on mon.sv-dev-ha02, mon.sv-dev-ha03"

ceph health details gives me (Truncated for readability):

[root@sv-dev-ha02 ~]# ceph health detail
HEALTH_WARN 91 pgs degraded; 192 pgs stale; 192 pgs stuck unclean; 2/6
in osds are down; clock skew detected on mon.sv-dev-ha02, mon.sv-dev-ha03
pg 2.3d is stuck unclean since forever, current state
stale+active+remapped, last acting [1,0]
pg 1.3e is stuck unclean since forever, current state
stale+active+remapped, last acting [1,0]
.... (Lots more lines like this) ...
pg 1.1 is stuck unclean since forever, current state
stale+active+remapped, last acting [1,0]
pg 0.0 is stuck unclean since forever, current state
stale+active+degraded, last acting [0]
pg 0.3f is stale+active+remapped, acting [1,0]
pg 1.3e is stale+active+remapped, acting [1,0]
... (Lots more lines like this) ...
pg 1.1 is stale+active+remapped, acting [1,0]
pg 2.2 is stale+active+remapped, acting [1,0]
osd.0 is down since epoch 25, last address 10.20.100.90:6800/3994
osd.1 is down since epoch 25, last address 10.20.100.90:6803/4758
mon.sv-dev-ha02 addr 10.20.100.91:6789/0 clock skew 0.0858782s>  max
0.05s (latency 0.00546217s)
mon.sv-dev-ha03 addr 10.20.100.92:6789/0 clock skew 0.0852838s>  max
0.05s (latency 0.00533693s)

Any help on how to start troubleshooting this issue would be appreciated.

Cheers,

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
josh mesilane
senior systems administrator

luma pictures
level 2
256 clarendon street
0416 039 082 m
lumapictures.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com