Re: cant get cluster to become healthy. "stale+undersized+degraded+peered"

Goncalo Borges <goncalo@xxxxxxxxxxxxxxxxxxx> · Thu, 17 Sep 2015 16:11:46 +1000

Hello Stefan...

Those 64 PGs refer to the default rbd pool which is created. Can you 
please give us the output of

    # ceph osd pool ls detail
    # ceph pg dump_stuck

The degraded / stale status means that the PGs can not be replicated 
according to your policies.

My guess is that you simply have too few OSDs for the number of replicas 
you are requesting

Cheers
G.

On 09/17/2015 02:59 AM, Stefan Eriksson wrote:
I have a completely new cluster for testing and its three servers 
which all are monitors and hosts for OSD, they each have one disk.
The issue is ceph status shows: 64 stale+undersized+degraded+peered

health:

     health HEALTH_WARN
            clock skew detected on mon.ceph01-osd03
            64 pgs degraded
            64 pgs stale
            64 pgs stuck degraded
            64 pgs stuck inactive
            64 pgs stuck stale
            64 pgs stuck unclean
            64 pgs stuck undersized
            64 pgs undersized
            too few PGs per OSD (21 < min 30)
            Monitor clock skew detected
     monmap e1: 3 mons at 
{ceph01-osd01=192.1.41.51:6789/0,ceph01-osd02=192.1.41.52:6789/0,ceph01-osd03=192.1.41.53:6789/0}
            election epoch 82, quorum 0,1,2 
ceph01-osd01,ceph01-osd02,ceph01-osd03
     osdmap e36: 3 osds: 3 up, 3 in
      pgmap v85: 64 pgs, 1 pools, 0 bytes data, 0 objects
            101352 kB used, 8365 GB / 8365 GB avail
                  64 stale+undersized+degraded+peered

ceph osd tree shows:
ID WEIGHT  TYPE NAME             UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.15996 root default
-2 2.71999     host ceph01-osd01
 0 2.71999         osd.0              up  1.00000          1.00000
-3 2.71999     host ceph01-osd02
 1 2.71999         osd.1              up  1.00000          1.00000
-4 2.71999     host ceph01-osd03
 2 2.71999         osd.2              up  1.00000          1.00000

Here is my crushmap:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph01-osd01 {
        id -2           # do not change unnecessarily
        # weight 2.720
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 2.720
}
host ceph01-osd02 {
        id -3           # do not change unnecessarily
        # weight 2.720
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 2.720
}
host ceph01-osd03 {
        id -4           # do not change unnecessarily
        # weight 2.720
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 2.720
}
root default {
        id -1           # do not change unnecessarily
        # weight 8.160
        alg straw
        hash 0  # rjenkins1
        item ceph01-osd01 weight 2.720
        item ceph01-osd02 weight 2.720
        item ceph01-osd03 weight 2.720
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map

And the ceph.conf which is shared among all nodes:

ceph.conf
[global]
fsid = b9043917-5f65-98d5-8624-ee12ff32a5ea
public_network = 192.1.41.0/24
cluster_network = 192.168.0.0/24
mon_initial_members = ceph01-osd01, ceph01-osd02, ceph01-osd03
mon_host = 192.1.41.51,192.1.41.52,192.1.41.53
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default pg num = 512
osd pool default pgp num = 512

Logs doesnt say much, the only active log which adds something is:

mon.ceph01-osd01@0(leader).data_health(82) update_stats avail 88% 
total 9990 MB, used 1170 MB, avail 8819 MB
mon.ceph01-osd02@1(peon).data_health(82) update_stats avail 88% total 
9990 MB, used 1171 MB, avail 8818 MB
mon.ceph01-osd03@2(peon).data_health(82) update_stats avail 88% total 
9990 MB, used 1172 MB, avail 8817 MB

Does anyone have a thoughts of what might be wrong? Or if there is 
other info I can provide to ease the search for what it might be?

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com