Re: Please Help, Ceph cluster lost and any recovery solution not worked. All OSDs are in but all PGs are unknown.

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 19 Jul 2021 09:53:57 -0500

Not sure if this has been resolved yet, but from a quick glance this
looks like a rook issue with just getting the OSD daemons started up.
Have you tried the rook slack?

s

On Fri, Jul 16, 2021 at 6:34 AM Ali Mihandoost <alimihandoost@xxxxxxxxx> wrote:
>
> We've had a serious problem with our production cluster & we need urgent help. Thanks in advance.**Is this a bug report or feature request?**
> * Bug Report**Deviation from expected behavior:**
> After updating our kubernetes and rebooting our servers, our OSDs have stopped working. We tried our best to restore them, using all different methods, checking all possibilities, issues, documents, and configurations.
> We restored our Kubernetes snapshot, and we restored the ETCD.
> Currently, the PGs status shows as unknown.**Expected behavior:**
> OSDs should have a normal state after an upgrade.**How to reproduce it (minimal and precise):**
> * Update kubernetes `v1.17` to `v1.18` and change all certificates then rook and ceph become unstable!
> * Restore ETCD to last stable backup of `v1.17` using `rke etcd snapshot-restore ...` and `rke up` again.**Best solution we applied to restore old ceph cluster**
> * Start a new and clean Rook Ceph cluster, with old CephCluster CephBlockPool CephFilesystem CephNFS CephObjectStore.
> * Shut the new cluster down when it has been created successfully.
> * Replace ceph-mon data with that of the old cluster.
> * Replace fsid in secrets/rook-ceph-mon with that of the old one.
> * Fix monmap in ceph-mon db.
> * Fix ceph mon auth key.
> * Disable auth.
> * Start the new cluster, watch it resurrect.
> Reference: https://rook.github.io/docs/rook/v1.6/ceph-disaster-recovery.html**Current state after all recovery solutions**
> Rook operator, manager, monitors, OSD pods, all agents are ready without fundamental errors.
> All OSDs are `in` but `down`
> Pools found but All PGs are `unknown`!
> `ceph -s`
> ```
>   cluster:
>     id:     .....
>     health: HEALTH_WARN
>             nodown,noout,norebalance flag(s) set
>             Reduced data availability: 64 pgs inactive
>             33 slow ops, oldest one blocked for 79746 sec, mon.a has slow ops  services:
>     mon: 1 daemons, quorum a (age 22h)
>     mgr: a(active, since 22h)
>     osd: 33 osds: 0 up, 33 in (since 22h)
>          flags nodown,noout,norebalance  data:
>     pools:   2 pools, 64 pgs
>     objects: 0 objects, 0 B
>     usage:   0 B used, 0 B / 0 B avail
>     pgs:     100.000% pgs unknown
>              64 unknown
> ```**Environment**:
> * OS:RancherOS v1.5.8
> * Kernel: v4.14.138-rancher
> * Cloud provider: Bare-metal and installed with RKE
> * Kubernetes v1.17
> * Ceph: v14.2.9 updated to v16.2.4 in recovery process
> * Rook: v1.2.7 updated to v1.6.7 in recovery processgithub issue link: https://github.com/rook/rook/issues/8329
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx