PGs stuck in unkown state

"Mr. Gecko" <grmrgecko@xxxxxxxxx> · Mon, 20 Sep 2021 00:51:37 -0500

Hello,

I'll start by explaining what I have done. I was adding some new storage 
in attempt to setup a cache pool according to 
https://docs.ceph.com/en/latest/dev/cache-pool/ by doing the following.

1. I upgraded all servers in cluster to ceph 15.2.14 which put the 
system into recovery for out of sync data.
2. I added 2 SSDs as OSDs to the cluster which immediately cause ceph to 
balance onto the SSDs.
3. I added 2 new crush rules which map to SSD storage vs HDD storage.
4. I assigned my existing VM pool to the HDD storage ruleset.
5. I added a new pool for the cache tier.
6. As adding the new pool seemed to be stuck, and I got an alert about 
PG 9.0 being in unknown state and the main storage pool has become 
inaccessible, I decided to reboot my servers incase it was a small issue 
that can be resolved by a reboot. After reboot, more PGs has gone into 
unknown state.
7. I reviewed 
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/ 
and checked ceph pg dump_stuck as follows:

# cephpgdump_stuck
ok
PG_STAT  STATE    UP  UP_PRIMARY  ACTING  ACTING_PRIMARY
7.1e     unknown  []          -1      []              -1
7.1f     unknown  []          -1      []              -1
7.1c     unknown  []          -1      []              -1
7.1d     unknown  []          -1      []              -1
7.12     unknown  []          -1      []              -1
7.13     unknown  []          -1      []              -1
7.10     unknown  []          -1      []              -1
7.11     unknown  []          -1      []              -1
7.16     unknown  []          -1      []              -1
7.15     unknown  []          -1      []              -1
7.a      unknown  []          -1      []              -1
7.b      unknown  []          -1      []              -1
7.8      unknown  []          -1      []              -1
7.9      unknown  []          -1      []              -1
7.4      unknown  []          -1      []              -1
7.19     unknown  []          -1      []              -1
7.3      unknown  []          -1      []              -1
7.e      unknown  []          -1      []              -1
7.f      unknown  []          -1      []              -1
7.c      unknown  []          -1      []              -1
7.d      unknown  []          -1      []              -1
7.0      unknown  []          -1      []              -1
7.1      unknown  []          -1      []              -1
7.1a     unknown  []          -1      []              -1
7.7      unknown  []          -1      []              -1
7.18     unknown  []          -1      []              -1
7.5      unknown  []          -1      []              -1

8. I tried to use the pg query command as follows:

# cephpg7.0query
Couldn't parse JSON : Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
 File "/usr/bin/ceph", line 1285, in <module>
   retval = main()
 File "/usr/bin/ceph", line 1204, in main
   sigdict = parse_json_funcsigs(outbuf.decode('utf-8'), 'cli')
 File "/usr/lib/python3.9/site-packages/ceph_argparse.py", line 836, in 
parse_json_funcsigs
   raise e
 File "/usr/lib/python3.9/site-packages/ceph_argparse.py", line 833, in 
parse_json_funcsigs
   overall = json.loads(s)
 File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
   return _default_decoder.decode(s)
 File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
   obj, end = self.raw_decode(s, idx=_w(s, 0).end())
 File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
   raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

9. With the odd errors, I did a lot more research and found I could try 
ceph osd force-create-pg which I have done, and after about an hour of 
waiting there is no change in the state. If I check, I get the following:

# cephosdforce-create-pg7.0--yes-i-really-mean-it
pg 7.0 already creating

Any help in bringing the cluster back into a healthy state would be 
appreciated.

Thanks,

James
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx