Re: stuck unclean/stuck inactive

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 20 Feb 2014 07:39:03 -0800



On Thu, Jan 30, 2014 at 1:17 PM, Derek Yarnell <derek@xxxxxxxxxxxxxx> wrote:
> Hi,
>
> So I am trying to remove OSDs from one of our 6 ceph OSDs, this is a
> brand new cluster and no data is yet on it.  I was following the manual
> procedure[1] with the following script.  I removed OSDs 0-3 but I am
> seeing ceph not fully recovering.
>
> #!/bin/bash
> ceph osd out ${1}
> /etc/init.d/ceph stop osd.${1}
> ceph osd crush remove osd.${1}
> ceph auth del osd.${1}
> ceph osd rm ${1}
>
> # ceph -v
> ceph version 0.72.2-2-g5169d4e (5169d4e957791533e6c1c1aa83c15486d0e7afea)
>
> # ceph status
>     cluster 7bdc37df-978c-4ddd-a3d4-97a06fc2b016
>      health HEALTH_WARN 8 pgs stuck inactive; 8 pgs stuck unclean
>      monmap e2: 3 mons at
> {objmon00=192.168.22.30:6789/0,objmon01=192.168.22.31:6789/0,objmon02=192.168.22.32:6789/0},
> election epoch 18, quorum 0,1,2 objmon00,objmon01,objmon02
>      osdmap e5567: 140 osds: 128 up, 128 in
>       pgmap v151073: 7612 pgs, 18 pools, 90085 kB data, 3244 objects
>             15978 MB used, 465 TB / 465 TB avail
>                    8 inactive
>                 7604 active+clean
>
> # ceph health detail
> HEALTH_WARN 8 pgs stuck inactive; 8 pgs stuck unclean
> pg 19.10a is stuck inactive since forever, current state inactive, last
> acting [7,99,20]
> pg 19.882 is stuck inactive since forever, current state inactive, last
> acting [9,46,64]
> pg 19.124a is stuck inactive since forever, current state inactive, last
> acting [82,108,14]
> pg 19.10db is stuck inactive for 1150.820893, current state inactive,
> last acting [63,54,72]
> pg 19.7a2 is stuck inactive for 1150.868763, current state inactive,
> last acting [18,75,122]
> pg 19.30a is stuck inactive for 1150.713369, current state inactive,
> last acting [107,75,16]
> pg 19.142a is stuck inactive for 1229.702841, current state inactive,
> last acting [119,16,74]
> pg 19.758 is stuck inactive for 1230.207810, current state inactive,
> last acting [23,136,81]
> pg 19.10a is stuck unclean since forever, current state inactive, last
> acting [7,99,20]
> pg 19.882 is stuck unclean since forever, current state inactive, last
> acting [9,46,64]
> pg 19.124a is stuck unclean since forever, current state inactive, last
> acting [82,108,14]
> pg 19.10db is stuck unclean for 1150.821256, current state inactive,
> last acting [63,54,72]
> pg 19.7a2 is stuck unclean for 1150.869125, current state inactive, last
> acting [18,75,122]
> pg 19.30a is stuck unclean for 1150.713731, current state inactive, last
> acting [107,75,16]
> pg 19.142a is stuck unclean for 1229.703203, current state inactive,
> last acting [119,16,74]
> pg 19.758 is stuck unclean for 1230.208172, current state inactive, last
> acting [23,136,81]
>
> # ceph pg 19.10a query
> { "state": "inactive",
>   "epoch": 5567,
>   "up": [
>         7,
>         99,
>         20],
>   "acting": [
>         7,
>         99,
>         20],
>   "info": { "pgid": "19.10a",
>   ...
>   "recovery_state": [
>         { "name": "Started\/Primary\/Peering\/WaitActingChange",
>           "enter_time": "2014-01-30 15:53:10.229039",
>           "comment": "waiting for pg acting set to change"},
>         { "name": "Started",
>           "enter_time": "2014-01-30 15:53:10.207856"}]}
>
> What does "waiting for pg acting set to change" this only has a single
> worthwhile hit on google for a bug a year old?  I have no data at risk
> on this cluster.

That state means that the new primary was unable to get the data it
needs to make any decisions about the PG, because nobody was providing
it. It sounds like you've got a cluster with 3x replication and you
just took down 3 OSDs at once without letting them export their data
first, and so naturally some data became inaccessible. ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com