Re: Some help needed with ceph deployment

Sage Weil <sage@xxxxxxxxxxx> · Tue, 27 Aug 2013 08:41:23 -0700 (PDT)

Can you post 'ceph osd dump --format=json-pretty'?  I'm guessing that the 
replication level or crush rules are such that a single host with 6 osds 
can't satisfy it.

sage

On Tue, 27 Aug 2013, Johannes Klarenbeek wrote:

> 
> Hi,
> 
>  
> 
> It seems that all my pgs are stuck somewhat. I?m not sure what to do from
> here. I waited a day in the hope that ceph would find a way to deal with
> this? but nothing happened.
> 
> I?m testing on a single ubuntu server 13.04 with dumpling 0.67.2. Below is
> my ceph status.
> 
>  
> 
> root@cephnode2:/root# ceph -s
> 
>   cluster 9087eb7a-abe1-4d38-99dc-cb6b266f0f84
> 
>    health HEALTH_WARN 37 pgs degraded; 192 pgs stuck unclean
> 
>    monmap e1: 1 mons at {cephnode2=172.16.1.2:6789/0}, election epoch 1,
> quorum 0 cephnode2
> 
>    osdmap e38: 6 osds: 6 up, 6 in
> 
>     pgmap v65: 192 pgs: 155 active+remapped, 37 active+degraded; 0 bytes
> data, 213 MB used, 11172 GB / 11172 GB avail
> 
>    mdsmap e1: 0/0/1 up
> 
>  
> 
> root@cephnode2:/root# ceph osd tree
> 
> # id    weight  type name       up/down reweight
> 
> -1      10.92   root default
> 
> -2      10.92           host cephnode2
> 
> 0       1.82                    osd.0   up      1
> 
> 1       1.82                    osd.1   up      1
> 
> 2       1.82                    osd.2   up      1
> 
> 3       1.82                    osd.3   up      1
> 
> 4       1.82                    osd.4   up      1
> 
> 5       1.82                    osd.5   up      1
> 
>  
> 
> root@cephnode2:/root#ceph health detail
> 
> HEALTH_WARN 37 pgs degraded; 192 pgs stuck unclean
> 
> pg 0.3f is stuck unclean since forever, current state active+remapped, last
> acting [2,0]
> 
> pg 1.3e is stuck unclean since forever, current state active+remapped, last
> acting [2,0]
> 
> pg 2.3d is stuck unclean since forever, current state active+remapped, last
> acting [2,0]
> 
> pg 0.3e is stuck unclean since forever, current state active+remapped, last
> acting [4,0]
> 
> pg 1.3f is stuck unclean since forever, current state active+remapped, last
> acting [1,0]
> 
> pg 2.3c is stuck unclean since forever, current state active+remapped, last
> acting [4,0]
> 
> pg 0.3d is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 1.3c is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 2.3f is stuck unclean since forever, current state active+remapped, last
> acting [4,1]
> 
> pg 0.3c is stuck unclean since forever, current state active+remapped, last
> acting [3,1]
> 
> pg 1.3d is stuck unclean since forever, current state active+remapped, last
> acting [4,0]
> 
> pg 2.3e is stuck unclean since forever, current state active+remapped, last
> acting [1,0]
> 
> pg 0.3b is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 1.3a is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 2.39 is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 0.3a is stuck unclean since forever, current state active+remapped, last
> acting [1,0]
> 
> pg 1.3b is stuck unclean since forever, current state active+remapped, last
> acting [3,1]
> 
> pg 2.38 is stuck unclean since forever, current state active+remapped, last
> acting [1,0]
> 
> pg 0.39 is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 1.38 is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 2.3b is stuck unclean since forever, current state active+degraded, last
> acting [0]
> 
> pg 0.38 is stuck unclean since forever, current state active+remapped, last
> acting [1,0]
> 
> pg 1.39 is stuck unclean since forever, current state active+remapped, last
> acting [1,0]
> 
> pg 2.3a is stuck unclean since forever, current state active+remapped, last
> acting [3,1]
> 
> pg 0.37 is stuck unclean since forever, current state active+remapped, last
> acting [3,2]
> 
> [?] and many more.
> 
>  
> 
> I found one entry on the mailing list from someone that had a similar issue
> and he fixed it with the following commands:
> 
>  
> 
> #ceph osd getcrushmap -o /tmp/crush
> 
> #crushtool -i /tmp/crush --enable-unsafe-tunables
> 
> --set-choose-local-tries 0 --set-choose-local-fallback-tries 0
> 
> --set-choose-total-tries 50 -o /tmp/crush.new
> 
> root@ceph-admin:/etc/ceph# ceph osd setcrushmap -i /tmp/crush.new
> 
>  
> 
> but I?m not sure what he is trying to do here. Especially
> ?enable-unsafe-tunables seems a little ? unsafe.
> 
>  
> 
> I also read thishttp://eu.ceph.com/docs/wip-3060/ops/manage/failures/osd/#failures-osd-unfo
> und link. But it doesn?t detail about any actions that one can do in order
> to fix it to a HEALTH_OK status.
> 
>  
> 
>  
> 
> Regards,
> 
> Johannes
> 
> 
> 
> __________ Informatie van ESET Endpoint Antivirus, versie van database
> viruskenmerken 8733 (20130827) __________
> 
> Het bericht is gecontroleerd door ESET Endpoint Antivirus.
> 
> http://www.eset.com
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com