Re: Fwd: lost power. monitors died. Cephx errors now

Sean Sullivan <seapasulli@xxxxxxxxxxxx> · Fri, 12 Aug 2016 13:09:21 -0500

A coworker patched leveldb and we were able to export quite a bit of data from kh08's leveldb database. At this point I think I need to re-construct a new leveldb with whatever values I can. Is it the same leveldb database across all 3 montiors? IE will keys exported from one work in the other? All should have the same keys/values although constructed differently right? I can't blindly copy /var/lib/ceph/mon/ceph-$(hostname)/store.db/ from one host to another right? But can I copy the keys/values from one to another? 

On Fri, Aug 12, 2016 at 12:45 PM, Sean Sullivan <seapasulli@xxxxxxxxxxxx> wrote:
ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in ceph-test package::

I can't seem to get it working :-( dump monmap or any of the commands. They all bomb out with the same message:
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool /var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/store.db/10882319.ldb
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool /var/lib/ceph/mon/ceph-kh10-8 dump-keys
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/store.db/10882319.ldb

I need to clarify as I originally had 2 clusters with this issue and now I have 1 with all 3 monitors dead and 1 that I was successfully able to repair. I am about to recap everything I know about the issue and the issue at hand. Should I start a new email thread about this instead?

The cluster that is currently having issues is on hammer (94.7), and the monitor stats are the same::
root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c
     24 model name	: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
     ext4 volume comprised of 4x300GB 10k drives in raid 10.
     ubuntu 14.04

root@kh08-8:~# uname -a
Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root@kh08-8:~# ceph --version
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)

From here: Here are the errors I am getting when starting each of the monitors::

---------------
root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d
2016-08-11 22:15:23.731550 7fe5ad3e98c0  0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309
Corruption: error in middle of record
2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument
--
root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d
2016-08-11 22:14:28.252370 7f7eaab908c0  0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 308888
Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/store.db/10845998.ldb
2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument
--
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon --cluster=ceph -i kh10-8 -d
2016-08-11 22:17:54.632762 7f80bf34d8c0  0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/store.db/10882319.ldb
2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument
---------------

for kh08, a coworker patched leveldb to print and skip on the first error and that one is also missing a bunch of files. As such I think kh10-8 is my most likely candidate to recover but either way recovery is probably not an option. I see leveldb has a repair.cc (https://github.com/google/leveldb/blob/master/db/repair.cc)) but I do not see repair mentioned in monitor in respect to the dbstore. I tried using the leveldb python module (plyvel) to attempt a repair but my repl just ends up dying. 

I understand two things:: 1.) Without rebuilding the monitor backend leveldb (the cluster map as I understand it) store all of the data in the cluster is essentialy lost (right?)
                                         2.) it is possible to rebuild this database via some form of magic or (source)ry as all of this data is essential held throughout the cluster as well.

We only use radosgw / S3 for this cluster. If there is a way to recover my data that is easier//more likely than rebuilding the leveldb of a monitor and starting a single monitor cluster up I would like to switch gears and focus on that. 

Looking at the dev docs:
http://docs.ceph.com/docs/hammer/architecture/#cluster-map
it has 5 main parts::

```
The Monitor Map: Contains the cluster fsid, the position, name address and port of each monitor. It also indicates the current epoch, when the map was created, and the last time it changed. To view a monitor map, execute ceph mon dump.
The OSD Map: Contains the cluster fsid, when the map was created and last modified, a list of pools, replica sizes, PG numbers, a list of OSDs and their status (e.g., up, in). To view an OSD map, execute ceph osd dump.
The PG Map: Contains the PG version, its time stamp, the last OSD map epoch, the full ratios, and details on each placement group such as the PG ID, the Up Set, the Acting Set, the state of the PG (e.g., active + clean), and data usage statistics for each pool.
The CRUSH Map: Contains a list of storage devices, the failure domain hierarchy (e.g., device, host, rack, row, room, etc.), and rules for traversing the hierarchy when storing data. To view a CRUSH map, execute ceph osd getcrushmap -o {filename}; then, decompile it by executing crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}. You can view the decompiled map in a text editor or with cat.
The MDS Map: Contains the current MDS map epoch, when the map was created, and the last time it changed. It also contains the pool for storing metadata, a list of metadata servers, and which metadata servers are up and in. To view an MDS map, execute ceph mds dump.
```

As we don't use cephfs mds can essentially be blank(right) so I am left with 4 valid maps needed to get a working cluster again. I don't see auth mentioned in there but that too.  Then I just need to rebuild the leveldb database somehow with the right information and I should be good. So long long long journey ahead.  

I don't think that the data is stored in strings or json, right? Am I going down the wrong path here? Is there a shorter/simpler path to retrieve the data from a cluster that lost all 3 monitors in power falure? If I am going down the right path is there any advice on how I can assemble/repair the database?

I see that there is a rbd recovery from a dead cluster tool. Is it possible to do the same with s3 objects?

On Thu, Aug 11, 2016 at 11:15 AM, Wido den Hollander <wido@xxxxxxxx> wrote:

> Op 11 augustus 2016 om 15:17 schreef Sean Sullivan <seapasulli@xxxxxxxxxxxx>:

>

>

> Hello Wido,

>

> Thanks for the advice.  While the data center has a/b circuits and

> redundant power, etc if a ground fault happens it  travels outside and

> fails causing the whole building to fail (apparently).

>

> The monitors are each the same with

> 2x e5 cpus

> 64gb of ram

> 4x 300gb 10k SAS drives in raid 10 (write through mode).

> Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 -

> 3am CST)

> Ceph hammer LTS 0.94.7

>

> (we are still working on our jewel test cluster so it is planned but not in

> place yet)

>

> The only thing that seems to be corrupt is the monitors leveldb store.  I

> see multiple issues on Google leveldb github from March 2016 about fsync

> and power failure so I assume this is an issue with leveldb.

>

> I have backed up /var/lib/ceph/Mon on all of my monitors before trying to

> proceed with any form of recovery.

>

> Is there any way to reconstruct the leveldb or replace the monitors and

> recover the data?

>

I don't know. I have never done it. Other people might know this better than me.

Maybe 'ceph-monstore-tool' can help you?

Wido

> I found the following post in which sage says it is tedious but possible. (

> http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if

> I have any chance of doing it.  I have the fsid, the Mon key map and all of

> the osds look to be fine so all of the previous osd maps  are there.

>

> I just don't understand what key/values I need inside.

>

> On Aug 11, 2016 1:33 AM, "Wido den Hollander" <wido@xxxxxxxx> wrote:

>

> >

> > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan <

> > seapasulli@xxxxxxxxxxxx>:

> > >

> > >

> > > I think it just got worse::

> > >

> > > all three monitors on my other cluster say that ceph-mon can't open

> > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose

> > all

> > > 3 monitors? I saw a post by Sage saying that the data can be recovered as

> > > all of the data is held on other servers. Is this possible? If so has

> > > anyone had any experience doing so?

> >

> > I have never done so, so I couldn't tell you.

> >

> > However, it is weird that on all three it got corrupted. What hardware are

> > you using? Was it properly protected against power failure?

> >

> > If you mon store is corrupted I'm not sure what might happen.

> >

> > However, make a backup of ALL monitors right now before doing anything.

> >

> > Wido

> >

> > > _______________________________________________

> > > ceph-users mailing list

> > > ceph-users@xxxxxxxxxxxxxx

> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >

-- 
- Sean:  I wrote this. - 

-- 
- Sean:  I wrote this. - 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com