Re: Recovering from no quorum (2/3 monitors down) via 1 good monitor

Syahrul Sazli Shaharir <sazli@xxxxxxxxxx> · Wed, 11 Jul 2018 01:21:49 +0800

Hi Paul,

Yes that's what I did - caused some errors. In the end I had to delete
the /var/lib/ceph/mon/* directory in the bad node and run inject with
--mkfs argument to recreate the database. I am good now - thanks. :)

On Tue, Jul 10, 2018 at 10:46 PM, Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:
> easy:
>
> 1. make sure that none of the mons are running
> 2. extract the monmap from the good one
> 3. use monmaptool to remove the two other mons from it
> 4. inject the mon map back into the good mon
> 5. start the good mon
> 6. you now have a running cluster with only one mon, add two new ones
>
>
>   Paul
>
>
> 2018-07-10 5:50 GMT+02:00 Syahrul Sazli Shaharir <sazli@xxxxxxxxxx>:
>>
>> Hi,
>>
>> I am running proxmox pve-5.1, with ceph luminous 12.2.4 as storage. I
>> have been running on 3 monitors, up until an abrupt power outage,
>> resulting in 2 monitors down and unable to start, while 1 monitor up
>> but with no quorum.
>>
>> I tried extracting monmap from the good monitor and injecting it into
>> the other two, but got different errors for each:-
>>
>> 1. mon.mail1
>>
>> # ceph-mon -i mail1 --inject-monmap /tmp/monmap
>> 2018-07-10 11:29:03.562840 7f7d82845f80 -1 abort: Corruption: Bad
>> table magic number*** Caught signal (Aborted) **
>>  in thread 7f7d82845f80 thread_name:ceph-mon
>>
>>  ceph version 12.2.4 (4832b6f0acade977670a37c20ff5dbe69e727416)
>> luminous (stable)
>>  1: (()+0x9439e4) [0x5652655669e4]
>>  2: (()+0x110c0) [0x7f7d81bfe0c0]
>>  3: (gsignal()+0xcf) [0x7f7d7ee12fff]
>>  4: (abort()+0x16a) [0x7f7d7ee1442a]
>>  5: (RocksDBStore::get(std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > const&,
>> std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> > const&, ceph::buffer::list*)+0x2f9)
>> [0x5652650a2eb9]
>>  6: (main()+0x1377) [0x565264ec3c57]
>>  7: (__libc_start_main()+0xf1) [0x7f7d7ee002e1]
>>  8: (_start()+0x2a) [0x565264f5954a]
>> 2018-07-10 11:29:03.563721 7f7d82845f80 -1 *** Caught signal (Aborted) **
>>  in thread 7f7d82845f80 thread_name:ceph-mon
>>
>> 2.  mon,mail2
>>
>> # ceph-mon -i mail2 --inject-monmap /tmp/monmap
>> 2018-07-10 11:18:07.536097 7f161e2e3f80 -1 rocksdb: Corruption: Can't
>> access /065339.sst: IO error:
>> /var/lib/ceph/mon/ceph-mail2/store.db/065339.sst: No such file or
>> directory
>> Can't access /065337.sst: IO error:
>> /var/lib/ceph/mon/ceph-mail2/store.db/065337.sst: No such file or
>> directory
>>
>> 2018-07-10 11:18:07.536106 7f161e2e3f80 -1 error opening mon data
>> directory at '/var/lib/ceph/mon/ceph-mail2': (22) Invalid argument
>>
>> Any other way I can recover other than rebuilding the monitor store
>> from the OSDs?
>>
>> Thanks.
>>
>> --
>> --sazli
>> Syahrul Sazli Shaharir <sazli@xxxxxxxxxx>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90

-- 
--sazli
Syahrul Sazli Shaharir <sazli@xxxxxxxxxx>
Mobile: +6019 385 8301 - YM/Skype: syahrulsazli
System Administrator
TMK Pulasan (002339810-M) http://pulasan.my/
11 Jalan 3/4, 43650 Bandar Baru Bangi, Selangor, Malaysia.
Tel/Fax: +603 8926 0338
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com