On Tue, Nov 5, 2013 at 3:02 PM, Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> wrote: > After remove ( ceph osd out X) osd from one server ( 11 osd ) ceph > starts data migration process. > It stopped on: > 32424 pgs: 30635 active+clean, 191 active+remapped, 1596 > active+degraded, 2 active+clean+scrubbing; > degraded (1.718%) > > All osd with reweight==1 are UP. > > ceph -v > ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) Hi, Below, I'm pasting some more information on this issue. The cluster status hasn't been changed for more than 24 hours: # ceph health HEALTH_WARN 1596 pgs degraded; 1787 pgs stuck unclean; recovery 2142704/123949567 degraded (1.729%) I parsed the output of ceph pg dump and can see there three types of pg states: 1. *Two* osd's are up and *two* acting: 16.11 [42, 92] [42, 92] active+degraded 17.10 [42, 92] [42, 92] active+degraded 2. *Three* osd's are up and *three* acting: 12.d [114, 138, 5] [114, 138, 5] active+clean 15.e [13, 130, 142] [13, 130, 142] active+clean 3. *Two* osd's that are and *three* acting: 16.2256 [63, 109] [63, 109, 40] active+remapped 16.220b [129, 22] [129, 22, 47] active+remapped A part of the crush map: rack rack1 { id -5 # do not change unnecessarily # weight 60.000 alg straw hash 0 # rjenkins1 item storinodfs1 weight 12.000 item storinodfs11 weight 12.000 item storinodfs6 weight 12.000 item storinodfs9 weight 12.000 item storinodfs8 weight 12.000 } rack rack2 { id -7 # do not change unnecessarily # weight 48.000 alg straw hash 0 # rjenkins1 item storinodfs3 weight 12.000 item storinodfs4 weight 12.000 item storinodfs2 weight 12.000 item storinodfs10 weight 12.000 } rack rack3 { id -10 # do not change unnecessarily # weight 36.000 alg straw hash 0 # rjenkins1 item storinodfs5 weight 12.000 <=== all osd's on this node have been disabled by ceph osd out item storinodfs7 weight 12.000 item storinodfs12 weight 12.000 } rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } The command ceph osd out has been invoked on all osd's on storinodfs5, and I can see them all down while listing with ceph osd tree: -11 12 host storinodfs5 48 1 osd.48 down 0 49 1 osd.49 down 0 50 1 osd.50 down 0 51 1 osd.51 down 0 52 1 osd.52 down 0 53 1 osd.53 down 0 54 1 osd.54 down 0 55 1 osd.55 down 0 56 1 osd.56 down 0 57 1 osd.57 down 0 58 1 osd.58 down 0 59 1 osd.59 down 0 I wonder if the current cluster state might be related to the fact that the crush map keeps information that storinodfs5 has weight 12? We're unable to make ceph recover from this faulty state. Any hints are very appreciated. -- Regards, Bohdan Sydor _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com