Re: Power outage recovery

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 15 Sep 2022 16:09:33 -0700

Recovery from OSDs loses the mds and rgw keys they use to authenticate with
cephx. You need to get those set up again by using the auth commands. I
don’t have them handy but it is discussed in the mailing list archives.
-Greg

On Thu, Sep 15, 2022 at 3:28 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx> wrote:

> Yes, I tried restarting them and even rebooting the mds machine. No joy.
> If I try to start ceph-mds by hand, it returns:
>
> 2022-09-15 15:21:39.848 7fc43dbd2700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [2]
> failed to fetch mon config (--no-mon-config to skip)
>
> I found this information online, maybe something to try next:
>
> https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/
>
> But I think maybe the mds needs to be running before that?
>
> On 9/15/22 15:19, Wesley Dillingham wrote:
> > Having the quorum / monitors back up may change the MDS and RGW's
> > ability to start and stay running. Have you tried just restarting the
> > MDS / RGW daemons again?
> >
> > Respectfully,
> >
> > *Wes Dillingham*
> > wes@xxxxxxxxxxxxxxxxx
> > LinkedIn <http://www.linkedin.com/in/wesleydillingham>
> >
> >
> > On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx>
> wrote:
> >
> >     OK, I'll try to give more details as I remember them.
> >
> >     1. There was a power outage and then power came back up.
> >
> >     2. When the systems came back up, I did a "ceph -s" and it never
> >     returned. Further investigation revealed that the ceph-mon
> >     processes had
> >     not started in any of the 3 monitors. I looked at the log files
> >     and it
> >     said something about:
> >
> >     ceph_abort_msg("Bad table magic number: expected 9863518390377041911,
> >     found 30790637387776 in
> >     /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")
> >
> >     Looking at the internet, I found some suggestions about
> >     troubleshooting
> >     monitors in:
> >
> >
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
> >
> >     I quickly determined that the monitors weren't running, so I found
> >     the
> >     section where it said "RECOVERY USING OSDS". The description made
> >     sense:
> >
> >     "But what if all monitors fail at the same time? Since users are
> >     encouraged to deploy at least three (and preferably five) monitors
> >     in a
> >     Ceph cluster, the chance of simultaneous failure is rare. But
> >     unplanned
> >     power-downs in a data center with improperly configured disk/fs
> >     settings
> >     could fail the underlying file system, and hence kill all the
> >     monitors.
> >     In this case, we can recover the monitor store with the information
> >     stored in OSDs."
> >
> >     So, I did the procedure described in that section, and then made sure
> >     the correct keys were in the keyring and restarted the processes.
> >
> >     WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL
> >     MESSAGE, AND
> >     NOW THE MONITORS ARE BACK! I must have missed some step in the
> >     middle of
> >     my panic.
> >
> >     # ceph -s
> >
> >        cluster:
> >          id:     aaaaaaaa-bbbb-cccc-dddd-ffffffffffff
> >          health: HEALTH_WARN
> >                  mons are allowing insecure global_id reclaim
> >
> >        services:
> >          mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
> >          mgr: host-b(active, since 19m), standbys: host-a, host-c
> >          osd: 164 osds: 164 up (since 16m), 164 in (since 8h)
> >
> >        data:
> >          pools:   14 pools, 2992 pgs
> >          objects: 91.58M objects, 290 TiB
> >          usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
> >          pgs:     2985 active+clean
> >                   7    active+clean+scrubbing+deep
> >
> >     Couple of missing or strange things:
> >
> >     1. Missing mds
> >     2. Missing rgw
> >     3. New warning showing up
> >
> >     But overall, better than a couple hours ago. If anybody is still
> >     reading
> >     and has any suggestions about how to solve the 3 items above, that
> >     would
> >     be great! Otherwise, back to scanning the internet for ideas...
> >
> >     _______________________________________________
> >     ceph-users mailing list -- ceph-users@xxxxxxx
> >     To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx