stuck unclean/stuck inactive

Derek Yarnell <derek@xxxxxxxxxxxxxx> · Thu, 30 Jan 2014 16:17:49 -0500

Hi,

So I am trying to remove OSDs from one of our 6 ceph OSDs, this is a
brand new cluster and no data is yet on it.  I was following the manual
procedure[1] with the following script.  I removed OSDs 0-3 but I am
seeing ceph not fully recovering.

#!/bin/bash
ceph osd out ${1}
/etc/init.d/ceph stop osd.${1}
ceph osd crush remove osd.${1}
ceph auth del osd.${1}
ceph osd rm ${1}

# ceph -v
ceph version 0.72.2-2-g5169d4e (5169d4e957791533e6c1c1aa83c15486d0e7afea)

# ceph status
    cluster 7bdc37df-978c-4ddd-a3d4-97a06fc2b016
     health HEALTH_WARN 8 pgs stuck inactive; 8 pgs stuck unclean
     monmap e2: 3 mons at
{objmon00=192.168.22.30:6789/0,objmon01=192.168.22.31:6789/0,objmon02=192.168.22.32:6789/0},
election epoch 18, quorum 0,1,2 objmon00,objmon01,objmon02
     osdmap e5567: 140 osds: 128 up, 128 in
      pgmap v151073: 7612 pgs, 18 pools, 90085 kB data, 3244 objects
            15978 MB used, 465 TB / 465 TB avail
                   8 inactive
                7604 active+clean

# ceph health detail
HEALTH_WARN 8 pgs stuck inactive; 8 pgs stuck unclean
pg 19.10a is stuck inactive since forever, current state inactive, last
acting [7,99,20]
pg 19.882 is stuck inactive since forever, current state inactive, last
acting [9,46,64]
pg 19.124a is stuck inactive since forever, current state inactive, last
acting [82,108,14]
pg 19.10db is stuck inactive for 1150.820893, current state inactive,
last acting [63,54,72]
pg 19.7a2 is stuck inactive for 1150.868763, current state inactive,
last acting [18,75,122]
pg 19.30a is stuck inactive for 1150.713369, current state inactive,
last acting [107,75,16]
pg 19.142a is stuck inactive for 1229.702841, current state inactive,
last acting [119,16,74]
pg 19.758 is stuck inactive for 1230.207810, current state inactive,
last acting [23,136,81]
pg 19.10a is stuck unclean since forever, current state inactive, last
acting [7,99,20]
pg 19.882 is stuck unclean since forever, current state inactive, last
acting [9,46,64]
pg 19.124a is stuck unclean since forever, current state inactive, last
acting [82,108,14]
pg 19.10db is stuck unclean for 1150.821256, current state inactive,
last acting [63,54,72]
pg 19.7a2 is stuck unclean for 1150.869125, current state inactive, last
acting [18,75,122]
pg 19.30a is stuck unclean for 1150.713731, current state inactive, last
acting [107,75,16]
pg 19.142a is stuck unclean for 1229.703203, current state inactive,
last acting [119,16,74]
pg 19.758 is stuck unclean for 1230.208172, current state inactive, last
acting [23,136,81]

# ceph pg 19.10a query
{ "state": "inactive",
  "epoch": 5567,
  "up": [
        7,
        99,
        20],
  "acting": [
        7,
        99,
        20],
  "info": { "pgid": "19.10a",
  ...
  "recovery_state": [
        { "name": "Started\/Primary\/Peering\/WaitActingChange",
          "enter_time": "2014-01-30 15:53:10.229039",
          "comment": "waiting for pg acting set to change"},
        { "name": "Started",
          "enter_time": "2014-01-30 15:53:10.207856"}]}

What does "waiting for pg acting set to change" this only has a single
worthwhile hit on google for a bug a year old?  I have no data at risk
on this cluster.

[1] - http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

Thanks,
derek

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com