Still CRUSH problems with 0.94.1 ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

may there be a problem with the crush function during 'from scratch' installation of 0.94.1-0 ?

This has been tested many times, with ceph-deploy-1.5.22-0 or ceph-deploy-1.5.23-0. Platform RHEL7.

Each time, the new cluster ends up in a weird state never seen on my previous installed versions (0.94, 0.87.1), - I've seen things perhaps linked to ceph-deploy-1.5.23-0, either one or more monitors being unable to form the cluster (with respawning 'python /usr/sbin/ceph-create-keys' messages). But I think that's other part of the issue. - the main issue is visible as a warning on health of the PGs as soon as the cluster is enough formed to answer a 'ceph -s'.

- here is a 1 Mon, almost empty freshly installed cluster :

ROOT > ceph -s
   cluster e581ab43-d0f5-4ea8-811f-94c8df16d044
    health HEALTH_WARN
           2 pgs degraded
           14 pgs peering
           4 pgs stale
           2 pgs stuck degraded
           25 pgs stuck inactive
           4 pgs stuck stale
           27 pgs stuck unclean
           2 pgs stuck undersized
           2 pgs undersized
           too few PGs per OSD (3 < min 30)
    monmap e1: 1 mons at {helga=10.10.10.64:6789/0}
           election epoch 2, quorum 0 helga
    osdmap e398: 60 osds: 60 up, 60 in; 2 remapped pgs
     pgmap v1553: 64 pgs, 1 pools, 0 bytes data, 0 objects
           2829 MB used, 218 TB / 218 TB avail
                 37 active+clean
                 12 peering
                 11 activating
                  2 stale+active+undersized+degraded
                  2 stale+remapped+peering

with time, the number of defects is growing. They literraly explode if we put objects on it.

- a 'ceph health detail' show for example entries like this one :
pg 0.22 is stuck inactive since forever, current state peering, last acting [18,17,0]

- A query on the PG shows
ceph pg  0.22 query
{
   "state": "peering",
../..
    "up": [
       18,
       17,
       0
   ],
          "blocked_by": [
               0,
               1,
               5,
               17
           ],
../..
}


If my understanding of the ceph query is correct, OSDs 1, 5 and 17 have nothing do do with this PG.... Where do they come from ?? Couldn't this be part of the "critical issues with CRUSH" 0.94.1 is meant to correct ?

Frederic
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux