Re: Power outage recovery

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Thu, 15 Sep 2022 18:19:49 -0400

Having the quorum / monitors back up may change the MDS and RGW's ability
to start and stay running. Have you tried just restarting the MDS / RGW
daemons again?

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx> wrote:

> OK, I'll try to give more details as I remember them.
>
> 1. There was a power outage and then power came back up.
>
> 2. When the systems came back up, I did a "ceph -s" and it never
> returned. Further investigation revealed that the ceph-mon processes had
> not started in any of the 3 monitors. I looked at the log files and it
> said something about:
>
> ceph_abort_msg("Bad table magic number: expected 9863518390377041911,
> found 30790637387776 in
> /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")
>
> Looking at the internet, I found some suggestions about troubleshooting
> monitors in:
>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
>
> I quickly determined that the monitors weren't running, so I found the
> section where it said "RECOVERY USING OSDS". The description made sense:
>
> "But what if all monitors fail at the same time? Since users are
> encouraged to deploy at least three (and preferably five) monitors in a
> Ceph cluster, the chance of simultaneous failure is rare. But unplanned
> power-downs in a data center with improperly configured disk/fs settings
> could fail the underlying file system, and hence kill all the monitors.
> In this case, we can recover the monitor store with the information
> stored in OSDs."
>
> So, I did the procedure described in that section, and then made sure
> the correct keys were in the keyring and restarted the processes.
>
> WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL MESSAGE, AND
> NOW THE MONITORS ARE BACK! I must have missed some step in the middle of
> my panic.
>
> # ceph -s
>
>    cluster:
>      id:     aaaaaaaa-bbbb-cccc-dddd-ffffffffffff
>      health: HEALTH_WARN
>              mons are allowing insecure global_id reclaim
>
>    services:
>      mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
>      mgr: host-b(active, since 19m), standbys: host-a, host-c
>      osd: 164 osds: 164 up (since 16m), 164 in (since 8h)
>
>    data:
>      pools:   14 pools, 2992 pgs
>      objects: 91.58M objects, 290 TiB
>      usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
>      pgs:     2985 active+clean
>               7    active+clean+scrubbing+deep
>
> Couple of missing or strange things:
>
> 1. Missing mds
> 2. Missing rgw
> 3. New warning showing up
>
> But overall, better than a couple hours ago. If anybody is still reading
> and has any suggestions about how to solve the 3 items above, that would
> be great! Otherwise, back to scanning the internet for ideas...
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx