Re: Power outage recovery

Jorge Garcia <jgarcia@xxxxxxxxxxxx> · Thu, 15 Sep 2022 15:27:55 -0700

Yes, I tried restarting them and even rebooting the mds machine. No joy. 
If I try to start ceph-mds by hand, it returns:

2022-09-15 15:21:39.848 7fc43dbd2700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2]
failed to fetch mon config (--no-mon-config to skip)

I found this information online, maybe something to try next:

https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/

But I think maybe the mds needs to be running before that?

On 9/15/22 15:19, Wesley Dillingham wrote:
Having the quorum / monitors back up may change the MDS and RGW's 
ability to start and stay running. Have you tried just restarting the 
MDS / RGW daemons again?

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx> wrote:

    OK, I'll try to give more details as I remember them.

    1. There was a power outage and then power came back up.

    2. When the systems came back up, I did a "ceph -s" and it never
    returned. Further investigation revealed that the ceph-mon
    processes had
    not started in any of the 3 monitors. I looked at the log files
    and it
    said something about:

    ceph_abort_msg("Bad table magic number: expected 9863518390377041911,
    found 30790637387776 in
    /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")

    Looking at the internet, I found some suggestions about
    troubleshooting
    monitors in:

    https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

    I quickly determined that the monitors weren't running, so I found
    the
    section where it said "RECOVERY USING OSDS". The description made
    sense:

    "But what if all monitors fail at the same time? Since users are
    encouraged to deploy at least three (and preferably five) monitors
    in a
    Ceph cluster, the chance of simultaneous failure is rare. But
    unplanned
    power-downs in a data center with improperly configured disk/fs
    settings
    could fail the underlying file system, and hence kill all the
    monitors.
    In this case, we can recover the monitor store with the information
    stored in OSDs."

    So, I did the procedure described in that section, and then made sure
    the correct keys were in the keyring and restarted the processes.

    WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL
    MESSAGE, AND
    NOW THE MONITORS ARE BACK! I must have missed some step in the
    middle of
    my panic.

    # ceph -s

       cluster:
         id:     aaaaaaaa-bbbb-cccc-dddd-ffffffffffff
         health: HEALTH_WARN
                 mons are allowing insecure global_id reclaim

       services:
         mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
         mgr: host-b(active, since 19m), standbys: host-a, host-c
         osd: 164 osds: 164 up (since 16m), 164 in (since 8h)

       data:
         pools:   14 pools, 2992 pgs
         objects: 91.58M objects, 290 TiB
         usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
         pgs:     2985 active+clean
                  7    active+clean+scrubbing+deep

    Couple of missing or strange things:

    1. Missing mds
    2. Missing rgw
    3. New warning showing up

    But overall, better than a couple hours ago. If anybody is still
    reading
    and has any suggestions about how to solve the 3 items above, that
    would
    be great! Otherwise, back to scanning the internet for ideas...

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx