Re: Some help needed with ceph deployment

Johannes Klarenbeek <Johannes.Klarenbeek@xxxxxxx> · Tue, 27 Aug 2013 13:56:41 +0000

Hi,

It seems that all my pgs are stuck somewhat. I’m not sure what to do from here. I waited a day in the hope that ceph would find a way to deal with
 this… but nothing happened. 
I’m testing on a single ubuntu server 13.04 with dumpling 0.67.2. Below is my ceph status.

root@cephnode2:/root# ceph -s
  cluster 9087eb7a-abe1-4d38-99dc-cb6b266f0f84
   health HEALTH_WARN 37 pgs degraded; 192 pgs stuck unclean
   monmap e1: 1 mons at {cephnode2=172.16.1.2:6789/0}, election epoch 1, quorum 0 cephnode2
   osdmap e38: 6 osds: 6 up, 6 in
    pgmap v65: 192 pgs: 155 active+remapped, 37 active+degraded; 0 bytes data, 213 MB used, 11172 GB / 11172 GB avail
   mdsmap e1: 0/0/1 up

root@cephnode2:/root# ceph osd tree
# id    weight  type name       up/down reweight
-1      10.92   root default
-2      10.92           host cephnode2
0       1.82                    osd.0   up      1
1       1.82                    osd.1   up      1
2       1.82                    osd.2   up      1
3       1.82                    osd.3   up      1
4       1.82                    osd.4   up      1
5       1.82                    osd.5   up      1

root@cephnode2:/root#ceph health detail
HEALTH_WARN 37 pgs degraded; 192 pgs stuck unclean
pg 0.3f is stuck unclean since forever, current state active+remapped, last acting [2,0]
pg 1.3e is stuck unclean since forever, current state active+remapped, last acting [2,0]
pg 2.3d is stuck unclean since forever, current state active+remapped, last acting [2,0]
pg 0.3e is stuck unclean since forever, current state active+remapped, last acting [4,0]
pg 1.3f is stuck unclean since forever, current state active+remapped, last acting [1,0]
pg 2.3c is stuck unclean since forever, current state active+remapped, last acting [4,0]
pg 0.3d is stuck unclean since forever, current state active+degraded, last acting [0]
pg 1.3c is stuck unclean since forever, current state active+degraded, last acting [0]
pg 2.3f is stuck unclean since forever, current state active+remapped, last acting [4,1]
pg 0.3c is stuck unclean since forever, current state active+remapped, last acting [3,1]
pg 1.3d is stuck unclean since forever, current state active+remapped, last acting [4,0]
pg 2.3e is stuck unclean since forever, current state active+remapped, last acting [1,0]
pg 0.3b is stuck unclean since forever, current state active+degraded, last acting [0]
pg 1.3a is stuck unclean since forever, current state active+degraded, last acting [0]
pg 2.39 is stuck unclean since forever, current state active+degraded, last acting [0]
pg 0.3a is stuck unclean since forever, current state active+remapped, last acting [1,0]
pg 1.3b is stuck unclean since forever, current state active+remapped, last acting [3,1]
pg 2.38 is stuck unclean since forever, current state active+remapped, last acting [1,0]
pg 0.39 is stuck unclean since forever, current state active+degraded, last acting [0]
pg 1.38 is stuck unclean since forever, current state active+degraded, last acting [0]
pg 2.3b is stuck unclean since forever, current state active+degraded, last acting [0]
pg 0.38 is stuck unclean since forever, current state active+remapped, last acting [1,0]
pg 1.39 is stuck unclean since forever, current state active+remapped, last acting [1,0]
pg 2.3a is stuck unclean since forever, current state active+remapped, last acting [3,1]
pg 0.37 is stuck unclean since forever, current state active+remapped, last acting [3,2]
[…] and many more.

I found one entry on the mailing list from someone that had a similar issue and he fixed it with the following commands:

#ceph osd getcrushmap -o /tmp/crush
#crushtool -i /tmp/crush --enable-unsafe-tunables
--set-choose-local-tries 0 --set-choose-local-fallback-tries 0
--set-choose-total-tries 50 -o /tmp/crush.new
root@ceph-admin:/etc/ceph# ceph osd setcrushmap -i /tmp/crush.new

but I’m not sure what he is trying to do here. Especially –enable-unsafe-tunables seems a little … unsafe.

I also read this

http://eu.ceph.com/docs/wip-3060/ops/manage/failures/osd/#failures-osd-unfound link. But it doesn’t detail about any actions that one can do in order to fix it to a HEALTH_OK status.

Regards,
Johannes

__________ Informatie van ESET Endpoint Antivirus, versie van database viruskenmerken 8733 (20130827) __________

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com