OSD startup failure

abwalters@xxxxxxxxxxxx (Adam Walters) · Sun, 1 Jun 2014 02:02:21 -0400

I'm running into a problem starting ceph-osd. Admittedly, it was caused by
me (I do believe I know exactly what caused the problem), but I have no
idea how (or even if) it can be fixed. The problem I have running into is
that on start. ceph-osd throws the below error and stack trace:

2014-06-01 01:47:05.548685 7f831fe807a0 -1 osd/PG.cc: In function 'static
epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
ceph::bufferlist*)' thread 7f831fe807a0 time 2014-06-01 01:47:05.548055
osd/PG.cc: 2576: FAILED assert(values.size() == 1)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
ceph::buffer::list*)+0x4ea) [0x7535fa]
 2: (OSD::load_pgs()+0x18f1) [0x568b71]
 3: (OSD::init()+0x22b0) [0x5814e0]
 4: (main()+0x3597) [0x525747]
 5: (__libc_start_main()+0xfd) [0x7f831db06d1d]
 6: /usr/bin/ceph-osd() [0x521df9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

Now for how I think it was caused, which I hope might help resolve it. This
system was originally running Archlinux, but due to difficulties when
updating it since I don't update often enough, I decided to switch to
CentOS. That part isn't the important piece, though. Archlinux used leveldb
1.14, while CentOS came with leveldb-1.7. After getting CentOS installed, I
tried starting ceph with 'service ceph -a start', which prompted me for a
ton of ssh passwords (I fixed that annoyance). Each daemon displayed some
log messages about failure to bind to address which I ignored since I
didn't want to interrupt the init script since I didn't know everything it
did. Sadly, this seems to have corrupted all of ceph's leveldb databases.
For the mon and mds daemons, I had tarball backups I took before killing
Archlinux. The osds, however, are larger than any storage medium in my
possession.

Once I figured out that leveldb was too old, I was able to compile
leveldb-1.14 from source and install it on my system. After restoring the
mds and mon data directories from backup (keep in mind that no ceph daemon
successfully started after CentOS was installed), those daemones are all
able to start. It is just the osds that fail now (and hindsight being
20-20, I really wish I had backed up the omap directories (which looks to
be where all of the leveldb stuff is held on the osds) before trying to
start them up.

If there isn't any reliable method to bring the ceph osds back to life,
would there be a method to recover the file data from the osd data
directories so that I could rebuild them without losing everything? It
wouldn't be the end of the world to lose my data, as most of the important
documents and pictures are backed up elsewhere, but I really don't want to
rip all my music and movies from disc again. Not to mention that it would
be slightly painful to lose my VM images.

~Adam Walters