Re: Please Help, Ceph cluster lost and any recovery solution not worked. All OSDs are in but all PGs are unknown.

Zizon Qiu <zzdtsv@xxxxxxxxx> · Fri, 16 Jul 2021 22:46:14 +0800

Can you provide some things like OSD logs?
Because noout/nodown flags were set , and mon storage was recovered, it is not precisely what
mon had received since it was restarted(the osd map presented because of the last setting).

And Mgr is active, so I suspect network connectivity is fine.
Meanwhile the auth is disabled, the old osd key should be not relevant(but I am not sure if it did something unexpected).

So,there must be something wrong between OSDs and Mon, OSD/Mon logs should help.

On Fri, Jul 16, 2021 at 7:33 PM Ali Mihandoost <alimihandoost@xxxxxxxxx> wrote:
We've had a serious problem with our production cluster & we need urgent help. Thanks in advance.**Is this a bug report or feature request?**
* Bug Report**Deviation from expected behavior:**
After updating our kubernetes and rebooting our servers, our OSDs have stopped working. We tried our best to restore them, using all different methods, checking all possibilities, issues, documents, and configurations.
We restored our Kubernetes snapshot, and we restored the ETCD.
Currently, the PGs status shows as unknown.**Expected behavior:**
OSDs should have a normal state after an upgrade.**How to reproduce it (minimal and precise):**
* Update kubernetes `v1.17` to `v1.18` and change all certificates then rook and ceph become unstable!
* Restore ETCD to last stable backup of `v1.17` using `rke etcd snapshot-restore ...` and `rke up` again.**Best solution we applied to restore old ceph cluster**
* Start a new and clean Rook Ceph cluster, with old CephCluster CephBlockPool CephFilesystem CephNFS CephObjectStore.
* Shut the new cluster down when it has been created successfully.
* Replace ceph-mon data with that of the old cluster.
* Replace fsid in secrets/rook-ceph-mon with that of the old one.
* Fix monmap in ceph-mon db.
* Fix ceph mon auth key.
* Disable auth.
* Start the new cluster, watch it resurrect.
Reference: https://rook.github.io/docs/rook/v1.6/ceph-disaster-recovery.html**Current state after all recovery solutions**
Rook operator, manager, monitors, OSD pods, all agents are ready without fundamental errors.
All OSDs are `in` but `down`
Pools found but All PGs are `unknown`!
`ceph -s`
```
  cluster:
    id:     .....
    health: HEALTH_WARN
            nodown,noout,norebalance flag(s) set
            Reduced data availability: 64 pgs inactive
            33 slow ops, oldest one blocked for 79746 sec, mon.a has slow ops  services:
    mon: 1 daemons, quorum a (age 22h)
    mgr: a(active, since 22h)
    osd: 33 osds: 0 up, 33 in (since 22h)
         flags nodown,noout,norebalance  data:
    pools:   2 pools, 64 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             64 unknown
```**Environment**:
* OS:RancherOS v1.5.8
* Kernel: v4.14.138-rancher
* Cloud provider: Bare-metal and installed with RKE
* Kubernetes v1.17
* Ceph: v14.2.9 updated to v16.2.4 in recovery process
* Rook: v1.2.7 updated to v1.6.7 in recovery processgithub issue link: https://github.com/rook/rook/issues/8329

_______________________________________________

Dev mailing list -- dev@xxxxxxx

To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx