Hi, Cephers!
I'm currently testing the situation of double failure for ceph cluster.
But, I faced that pgs are in stale state forever.
reproduce steps)
0. ceph version : jewel 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
1. Pool create : exp-volumes (size = 2, min_size = 1)
2. rbd create : testvol01
3. rbd map and create mkfs.xfs
4. mount and create file
5. list rados object
6. check osd map of each object
# ceph osd map exp-volumes rbd_data.4a41f238e1f29.000000000000017a
osdmap e199 pool 'exp-volumes' (2) object 'rbd_data.4a41f238e1f29.000000000000017a' -> pg 2.3f04d6e2 (2.62) -> up ([2,6], p2) acting ([2,6], p2)
7. stop primary osd.2 and secondary osd.6 of above object at the same time
8. check ceph status
health HEALTH_ERR
16 pgs are stuck inactive for more than 300 seconds
16 pgs stale
16 pgs stuck stale
monmap e11: 3 mons at {10.105.176.85=10.105.176.85:6789/0,10.110.248.154=10.110.248.154:6789/0,10.110.249.153=10.110.249.153:6789/0 }
election epoch 84, quorum 0,1,2 10.105.176.85,10.110.248.154,10.110.249.153
osdmap e248: 6 osds: 4 up, 4 in; 16 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v112095: 128 pgs, 1 pools, 14659 kB data, 17 objects
165 MB used, 159 GB / 160 GB avail
112 active+clean
16 stale+active+clean
# ceph health detail
HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds; 16 pgs stale; 16 pgs stuck stale
pg 2.67 is stuck stale for 689.171742, current state stale+active+clean, last acting [2,6]
pg 2.5a is stuck stale for 689.171748, current state stale+active+clean, last acting [6,2]
pg 2.52 is stuck stale for 689.171753, current state stale+active+clean, last acting [2,6]
pg 2.4d is stuck stale for 689.171757, current state stale+active+clean, last acting [2,6]
pg 2.56 is stuck stale for 689.171755, current state stale+active+clean, last acting [6,2]
pg 2.d is stuck stale for 689.171811, current state stale+active+clean, last acting [6,2]
pg 2.79 is stuck stale for 689.171808, current state stale+active+clean, last acting [2,6]
pg 2.1f is stuck stale for 689.171782, current state stale+active+clean, last acting [6,2]
pg 2.76 is stuck stale for 689.171809, current state stale+active+clean, last acting [6,2]
pg 2.17 is stuck stale for 689.171794, current state stale+active+clean, last acting [6,2]
pg 2.63 is stuck stale for 689.171794, current state stale+active+clean, last acting [2,6]
pg 2.77 is stuck stale for 689.171816, current state stale+active+clean, last acting [2,6]
pg 2.1b is stuck stale for 689.171793, current state stale+active+clean, last acting [6,2]
pg 2.62 is stuck stale for 689.171765, current state stale+active+clean, last acting [2,6]
pg 2.30 is stuck stale for 689.171799, current state stale+active+clean, last acting [2,6]
pg 2.19 is stuck stale for 689.171798, current state stale+active+clean, last acting [6,2]
# ceph pg dump_stuck stale
ok
pg_stat state up up_primary acting acting_primary
2.67 stale+active+clean [2,6] 2 [2,6] 2
2.5a stale+active+clean [6,2] 6 [6,2] 6
2.52 stale+active+clean [2,6] 2 [2,6] 2
2.4d stale+active+clean [2,6] 2 [2,6] 2
2.56 stale+active+clean [6,2] 6 [6,2] 6
2.d stale+active+clean [6,2] 6 [6,2] 6
2.79 stale+active+clean [2,6] 2 [2,6] 2
2.1f stale+active+clean [6,2] 6 [6,2] 6
2.76 stale+active+clean [6,2] 6 [6,2] 6
2.17 stale+active+clean [6,2] 6 [6,2] 6
2.63 stale+active+clean [2,6] 2 [2,6] 2
2.77 stale+active+clean [2,6] 2 [2,6] 2
2.1b stale+active+clean [6,2] 6 [6,2] 6
2.62 stale+active+clean [2,6] 2 [2,6] 2
2.30 stale+active+clean [2,6] 2 [2,6] 2
2.19 stale+active+clean [6,2] 6 [6,2] 6
# ceph pg 2.62 query
Error ENOENT: i don't have pgid 2.62
# rados ls -p exp-volumes
rbd_data.4a41f238e1f29.000000000000003f
^C --> hang
I understand that this is a natural result becasue above pgs have no primary and seconary osd. But this situation can be occurred so, I want to recover ceph cluster and rbd images.
Firstly I want to know how to make ceph cluster's state clean.
I read document and try to solve this but nothing can help including below commands.
- ceph pg force_create_pg 2.6
- ceph osd lost 2 --yes-i-really-mean-it
- ceph osd lost 6 --yes-i-really-mean-it
- ceph osd crush rm osd.2
- ceph osd crush rm osd.6
- cpeh osd rm osd.2
- ceph osd rm osd.6
Is there any command to force delete pgs or make ceph cluster clean ?
Thank you in advance.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com