Re: Power outage recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



OK, I'll try to give more details as I remember them.

1. There was a power outage and then power came back up.

2. When the systems came back up, I did a "ceph -s" and it never returned. Further investigation revealed that the ceph-mon processes had not started in any of the 3 monitors. I looked at the log files and it said something about:

ceph_abort_msg("Bad table magic number: expected 9863518390377041911, found 30790637387776 in /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")

Looking at the internet, I found some suggestions about troubleshooting monitors in:

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

I quickly determined that the monitors weren't running, so I found the section where it said "RECOVERY USING OSDS". The description made sense:

"But what if all monitors fail at the same time? Since users are encouraged to deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous failure is rare. But unplanned power-downs in a data center with improperly configured disk/fs settings could fail the underlying file system, and hence kill all the monitors. In this case, we can recover the monitor store with the information stored in OSDs."

So, I did the procedure described in that section, and then made sure the correct keys were in the keyring and restarted the processes.

WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL MESSAGE, AND NOW THE MONITORS ARE BACK! I must have missed some step in the middle of my panic.

# ceph -s

  cluster:
    id:     aaaaaaaa-bbbb-cccc-dddd-ffffffffffff
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
    mgr: host-b(active, since 19m), standbys: host-a, host-c
    osd: 164 osds: 164 up (since 16m), 164 in (since 8h)

  data:
    pools:   14 pools, 2992 pgs
    objects: 91.58M objects, 290 TiB
    usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
    pgs:     2985 active+clean
             7    active+clean+scrubbing+deep

Couple of missing or strange things:

1. Missing mds
2. Missing rgw
3. New warning showing up

But overall, better than a couple hours ago. If anybody is still reading and has any suggestions about how to solve the 3 items above, that would be great! Otherwise, back to scanning the internet for ideas...

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux