I have a hammer cluster that died a bit ago (hammer 94.9) consisting of 3 monitors and 630 osds spread across 21 storage hosts. The clusters monitors all died due to leveldb corruption and the cluster was shut down. I was finally given word that I could try to revive the cluster this week!
https://github.com/ceph/ceph/blob/hammer/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds
I see that the latest hammer code in github has the ceph-monstore-tool rebuild backport and that is what I am running on the cluster now (ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c). I was able to scrape all 630 of the osds and am left with a 1.1G store.db directory. Using python I was successfully able to list all of the keys and values which was very promising. That said I can not run the final command in the recovery-using-osds article (ceph-monstore-tool rebuild) successfully.
Whenever I run the tool (with the newly created admin keyring or with my existing one) it errors with the following:
The complete trace is here
http://pastebin.com/NQE8uYiG
Can anyone lend a hand and tell me what may be wrong? I am able to iterate over the leveldb database in python so the structure should be somewhat okay? Am I SOL at this point? The cluster isn't production any longer and while I don't have months of time I would really like to recover this cluster just to see if it is at all possible.--
https://github.com/ceph/ceph/blob/hammer/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds
I see that the latest hammer code in github has the ceph-monstore-tool rebuild backport and that is what I am running on the cluster now (ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c). I was able to scrape all 630 of the osds and am left with a 1.1G store.db directory. Using python I was successfully able to list all of the keys and values which was very promising. That said I can not run the final command in the recovery-using-osds article (ceph-monstore-tool rebuild) successfully.
Whenever I run the tool (with the newly created admin keyring or with my existing one) it errors with the following:
- 0> 2017-02-17 15:00:47.516901 7f8b4d7408c0 -1 ./mon/MonitorDBStore.h: In function 'KeyValueDB::Iterator MonitorDBStore::get_iterator(const string&)' thread 7f8b4d7408c0 time 2017-02-07 15:00:47.516319
The complete trace is here
http://pastebin.com/NQE8uYiG
Can anyone lend a hand and tell me what may be wrong? I am able to iterate over the leveldb database in python so the structure should be somewhat okay? Am I SOL at this point? The cluster isn't production any longer and while I don't have months of time I would really like to recover this cluster just to see if it is at all possible.
- Sean: I wrote this. -
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com