Still CRUSH problems with 0.94.1 ?

"fred@xxxxxxxxxx" <fred@xxxxxxxxxx> · Tue, 21 Apr 2015 15:03:25 +0200

Hi all,

may there be a problem with the crush function during 'from scratch' 
installation of 0.94.1-0 ?

This has been tested many times, with ceph-deploy-1.5.22-0 or 
ceph-deploy-1.5.23-0. Platform RHEL7.

Each time, the new cluster ends up in a weird state never seen on my 
previous installed versions (0.94, 0.87.1),
- I've seen things perhaps linked to ceph-deploy-1.5.23-0, either one or 
more monitors being unable to form the cluster (with respawning 'python 
/usr/sbin/ceph-create-keys' messages). But I think that's other part of 
the issue.
- the main issue is visible as a warning on health of the PGs as soon as 
the cluster is enough formed to answer a 'ceph -s'.

- here is a 1 Mon, almost empty freshly installed cluster :

ROOT > ceph -s
   cluster e581ab43-d0f5-4ea8-811f-94c8df16d044
    health HEALTH_WARN
           2 pgs degraded
           14 pgs peering
           4 pgs stale
           2 pgs stuck degraded
           25 pgs stuck inactive
           4 pgs stuck stale
           27 pgs stuck unclean
           2 pgs stuck undersized
           2 pgs undersized
           too few PGs per OSD (3 < min 30)
    monmap e1: 1 mons at {helga=10.10.10.64:6789/0}
           election epoch 2, quorum 0 helga
    osdmap e398: 60 osds: 60 up, 60 in; 2 remapped pgs
     pgmap v1553: 64 pgs, 1 pools, 0 bytes data, 0 objects
           2829 MB used, 218 TB / 218 TB avail
                 37 active+clean
                 12 peering
                 11 activating
                  2 stale+active+undersized+degraded
                  2 stale+remapped+peering

with time, the number of defects is growing. They literraly explode if 
we put objects on it.

- a 'ceph health detail' show for example entries like this one :
pg 0.22 is stuck inactive since forever, current state peering, last 
acting [18,17,0]

- A query on the PG shows
ceph pg  0.22 query
{
   "state": "peering",
../..
    "up": [
       18,
       17,
       0
   ],
          "blocked_by": [
               0,
               1,
               5,
               17
           ],
../..
}

If my understanding of the ceph query is correct, OSDs 1, 5 and 17 have 
nothing do do with this PG.... Where do they come from ??
Couldn't this be part of the "critical issues with CRUSH" 0.94.1 is 
meant to correct ?

Frederic
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com