ceph pgs state forever stale+active+clean

Hyun Ha <hfamily15@xxxxxxxxx> · Fri, 18 Aug 2017 14:28:33 +0900

Hi, Cephers!

I'm currently testing the situation of double failure for ceph cluster.
But, I faced that pgs are in stale state forever.

reproduce steps)
0. ceph version : jewel 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
1. Pool create : exp-volumes (size = 2, min_size = 1)
2. rbd create : testvol01 
3. rbd map and create mkfs.xfs
4. mount and create file
5. list rados object
6. check osd map of each object
 # ceph osd map exp-volumes rbd_data.4a41f238e1f29.000000000000017a
   osdmap e199 pool 'exp-volumes' (2) object 'rbd_data.4a41f238e1f29.000000000000017a' -> pg 2.3f04d6e2 (2.62) -> up ([2,6], p2) acting ([2,6], p2)
7. stop primary osd.2 and secondary osd.6 of above object at the same time 
8. check ceph status
health HEALTH_ERR
            16 pgs are stuck inactive for more than 300 seconds
            16 pgs stale
            16 pgs stuck stale
     monmap e11: 3 mons at {10.105.176.85=10.105.176.85:6789/0,10.110.248.154=10.110.248.154:6789/0,10.110.249.153=10.110.249.153:6789/0}
            election epoch 84, quorum 0,1,2 10.105.176.85,10.110.248.154,10.110.249.153
     osdmap e248: 6 osds: 4 up, 4 in; 16 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v112095: 128 pgs, 1 pools, 14659 kB data, 17 objects
            165 MB used, 159 GB / 160 GB avail
                 112 active+clean
                  16 stale+active+clean

# ceph health detail
HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 16 pgs stale; 16 pgs stuck stale
pg 2.67 is stuck stale for 689.171742, current state stale+active+clean, last acting [2,6]
pg 2.5a is stuck stale for 689.171748, current state stale+active+clean, last acting [6,2]
pg 2.52 is stuck stale for 689.171753, current state stale+active+clean, last acting [2,6]
pg 2.4d is stuck stale for 689.171757, current state stale+active+clean, last acting [2,6]
pg 2.56 is stuck stale for 689.171755, current state stale+active+clean, last acting [6,2]
pg 2.d is stuck stale for 689.171811, current state stale+active+clean, last acting [6,2]
pg 2.79 is stuck stale for 689.171808, current state stale+active+clean, last acting [2,6]
pg 2.1f is stuck stale for 689.171782, current state stale+active+clean, last acting [6,2]
pg 2.76 is stuck stale for 689.171809, current state stale+active+clean, last acting [6,2]
pg 2.17 is stuck stale for 689.171794, current state stale+active+clean, last acting [6,2]
pg 2.63 is stuck stale for 689.171794, current state stale+active+clean, last acting [2,6]
pg 2.77 is stuck stale for 689.171816, current state stale+active+clean, last acting [2,6]
pg 2.1b is stuck stale for 689.171793, current state stale+active+clean, last acting [6,2]
pg 2.62 is stuck stale for 689.171765, current state stale+active+clean, last acting [2,6]
pg 2.30 is stuck stale for 689.171799, current state stale+active+clean, last acting [2,6]
pg 2.19 is stuck stale for 689.171798, current state stale+active+clean, last acting [6,2]

 # ceph pg dump_stuck stale
ok
pg_stat state   up      up_primary      acting  acting_primary
2.67    stale+active+clean      [2,6]   2       [2,6]   2
2.5a    stale+active+clean      [6,2]   6       [6,2]   6
2.52    stale+active+clean      [2,6]   2       [2,6]   2
2.4d    stale+active+clean      [2,6]   2       [2,6]   2
2.56    stale+active+clean      [6,2]   6       [6,2]   6
2.d     stale+active+clean      [6,2]   6       [6,2]   6
2.79    stale+active+clean      [2,6]   2       [2,6]   2
2.1f    stale+active+clean      [6,2]   6       [6,2]   6
2.76    stale+active+clean      [6,2]   6       [6,2]   6
2.17    stale+active+clean      [6,2]   6       [6,2]   6
2.63    stale+active+clean      [2,6]   2       [2,6]   2
2.77    stale+active+clean      [2,6]   2       [2,6]   2
2.1b    stale+active+clean      [6,2]   6       [6,2]   6
2.62    stale+active+clean      [2,6]   2       [2,6]   2
2.30    stale+active+clean      [2,6]   2       [2,6]   2
2.19    stale+active+clean      [6,2]   6       [6,2]   6

# ceph pg 2.62 query
Error ENOENT: i don't have pgid 2.62

 # rados ls -p exp-volumes
rbd_data.4a41f238e1f29.000000000000003f
^C --> hang

I understand that this is a natural result becasue above pgs have no primary and seconary osd. But this situation can be occurred so, I want to recover ceph cluster and rbd images.

Firstly I want to know how to make ceph cluster's state clean.
I read document and try to solve this but nothing can help including below commands.
 - ceph pg force_create_pg 2.6
 - ceph osd lost 2 --yes-i-really-mean-it
 - ceph osd lost 6 --yes-i-really-mean-it
 - ceph osd crush rm osd.2
 - ceph osd crush rm osd.6
 - cpeh osd rm osd.2
 - ceph osd rm osd.6

Is there any command to force delete pgs or make ceph cluster clean ?
Thank you in advance.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com