We've had a serious problem with our production cluster & we need urgent help. Thanks in advance.**Is this a bug report or feature request?**
* Bug Report**Deviation from expected behavior:**
After updating our kubernetes and rebooting our servers, our OSDs have stopped working. We tried our best to restore them, using all different methods, checking all possibilities, issues, documents, and configurations.
We restored our Kubernetes snapshot, and we restored the ETCD.
Currently, the PGs status shows as unknown.**Expected behavior:**
OSDs should have a normal state after an upgrade.**How to reproduce it (minimal and precise):**
* Update kubernetes `v1.17` to `v1.18` and change all certificates then rook and ceph become unstable!
* Restore ETCD to last stable backup of `v1.17` using `rke etcd snapshot-restore ...` and `rke up` again.**Best solution we applied to restore old ceph cluster**
* Start a new and clean Rook Ceph cluster, with old CephCluster CephBlockPool CephFilesystem CephNFS CephObjectStore.
* Shut the new cluster down when it has been created successfully.
* Replace ceph-mon data with that of the old cluster.
* Replace fsid in secrets/rook-ceph-mon with that of the old one.
* Fix monmap in ceph-mon db.
* Fix ceph mon auth key.
* Disable auth.
* Start the new cluster, watch it resurrect.
Reference: https://rook.github.io/docs/rook/v1.6/ceph-disaster-recovery.html**Current state after all recovery solutions**
Rook operator, manager, monitors, OSD pods, all agents are ready without fundamental errors.
All OSDs are `in` but `down`
Pools found but All PGs are `unknown`!
`ceph -s`
```
cluster:
id: .....
health: HEALTH_WARN
nodown,noout,norebalance flag(s) set
Reduced data availability: 64 pgs inactive
33 slow ops, oldest one blocked for 79746 sec, mon.a has slow ops services:
mon: 1 daemons, quorum a (age 22h)
mgr: a(active, since 22h)
osd: 33 osds: 0 up, 33 in (since 22h)
flags nodown,noout,norebalance data:
pools: 2 pools, 64 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
64 unknown
```**Environment**:
* OS:RancherOS v1.5.8
* Kernel: v4.14.138-rancher
* Cloud provider: Bare-metal and installed with RKE
* Kubernetes v1.17
* Ceph: v14.2.9 updated to v16.2.4 in recovery process
* Rook: v1.2.7 updated to v1.6.7 in recovery processgithub issue link: https://github.com/rook/rook/issues/8329
* Bug Report**Deviation from expected behavior:**
After updating our kubernetes and rebooting our servers, our OSDs have stopped working. We tried our best to restore them, using all different methods, checking all possibilities, issues, documents, and configurations.
We restored our Kubernetes snapshot, and we restored the ETCD.
Currently, the PGs status shows as unknown.**Expected behavior:**
OSDs should have a normal state after an upgrade.**How to reproduce it (minimal and precise):**
* Update kubernetes `v1.17` to `v1.18` and change all certificates then rook and ceph become unstable!
* Restore ETCD to last stable backup of `v1.17` using `rke etcd snapshot-restore ...` and `rke up` again.**Best solution we applied to restore old ceph cluster**
* Start a new and clean Rook Ceph cluster, with old CephCluster CephBlockPool CephFilesystem CephNFS CephObjectStore.
* Shut the new cluster down when it has been created successfully.
* Replace ceph-mon data with that of the old cluster.
* Replace fsid in secrets/rook-ceph-mon with that of the old one.
* Fix monmap in ceph-mon db.
* Fix ceph mon auth key.
* Disable auth.
* Start the new cluster, watch it resurrect.
Reference: https://rook.github.io/docs/rook/v1.6/ceph-disaster-recovery.html**Current state after all recovery solutions**
Rook operator, manager, monitors, OSD pods, all agents are ready without fundamental errors.
All OSDs are `in` but `down`
Pools found but All PGs are `unknown`!
`ceph -s`
```
cluster:
id: .....
health: HEALTH_WARN
nodown,noout,norebalance flag(s) set
Reduced data availability: 64 pgs inactive
33 slow ops, oldest one blocked for 79746 sec, mon.a has slow ops services:
mon: 1 daemons, quorum a (age 22h)
mgr: a(active, since 22h)
osd: 33 osds: 0 up, 33 in (since 22h)
flags nodown,noout,norebalance data:
pools: 2 pools, 64 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
64 unknown
```**Environment**:
* OS:RancherOS v1.5.8
* Kernel: v4.14.138-rancher
* Cloud provider: Bare-metal and installed with RKE
* Kubernetes v1.17
* Ceph: v14.2.9 updated to v16.2.4 in recovery process
* Rook: v1.2.7 updated to v1.6.7 in recovery processgithub issue link: https://github.com/rook/rook/issues/8329
_______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx