Having the quorum / monitors back up may change the MDS and RGW's ability to start and stay running. Have you tried just restarting the MDS / RGW daemons again? Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx> wrote: > OK, I'll try to give more details as I remember them. > > 1. There was a power outage and then power came back up. > > 2. When the systems came back up, I did a "ceph -s" and it never > returned. Further investigation revealed that the ceph-mon processes had > not started in any of the 3 monitors. I looked at the log files and it > said something about: > > ceph_abort_msg("Bad table magic number: expected 9863518390377041911, > found 30790637387776 in > /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst") > > Looking at the internet, I found some suggestions about troubleshooting > monitors in: > > https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/ > > I quickly determined that the monitors weren't running, so I found the > section where it said "RECOVERY USING OSDS". The description made sense: > > "But what if all monitors fail at the same time? Since users are > encouraged to deploy at least three (and preferably five) monitors in a > Ceph cluster, the chance of simultaneous failure is rare. But unplanned > power-downs in a data center with improperly configured disk/fs settings > could fail the underlying file system, and hence kill all the monitors. > In this case, we can recover the monitor store with the information > stored in OSDs." > > So, I did the procedure described in that section, and then made sure > the correct keys were in the keyring and restarted the processes. > > WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL MESSAGE, AND > NOW THE MONITORS ARE BACK! I must have missed some step in the middle of > my panic. > > # ceph -s > > cluster: > id: aaaaaaaa-bbbb-cccc-dddd-ffffffffffff > health: HEALTH_WARN > mons are allowing insecure global_id reclaim > > services: > mon: 3 daemons, quorum host-a, host-b, host-c (age 19m) > mgr: host-b(active, since 19m), standbys: host-a, host-c > osd: 164 osds: 164 up (since 16m), 164 in (since 8h) > > data: > pools: 14 pools, 2992 pgs > objects: 91.58M objects, 290 TiB > usage: 437 TiB used, 1.2 PiB / 1.7 PiB avail > pgs: 2985 active+clean > 7 active+clean+scrubbing+deep > > Couple of missing or strange things: > > 1. Missing mds > 2. Missing rgw > 3. New warning showing up > > But overall, better than a couple hours ago. If anybody is still reading > and has any suggestions about how to solve the 3 items above, that would > be great! Otherwise, back to scanning the internet for ideas... > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx