Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

Nico Schottelius <nico-eph-users@xxxxxxxxxxxxxxx> · Tue, 30 Dec 2014 16:36:10 +0100

Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

    Is anyone using ceph as a backend for qemu VM images in production?

And:

    Has anyone on the list been able to recover from a pg incomplete /
    stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
"clean" for production (sorry for the word joke).

Cheers,

Nico

Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
> Hi Nico and all others who answered,
> 
> After some more trying to somehow get the pgs in a working state (I've
> tried force_create_pg, which was putting then in creating state. But
> that was obviously not true, since after rebooting one of the containing
> osd's it went back to incomplete), I decided to save what can be saved.
> 
> I've created a new pool, created a new image there, mapped the old image
> from the old pool and the new image from the new pool to a machine, to
> copy data on posix level.
> 
> Unfortunately, formatting the image from the new pool hangs after some
> time. So it seems that the new pool is suffering from the same problem
> as the old pool. Which is totaly not understandable for me.
> 
> Right now, it seems like Ceph is giving me no options to either save
> some of the still intact rbd volumes, or to create a new pool along the
> old one to at least enable our clients to send data to ceph again.
> 
> To tell the truth, I guess that will result in the end of our ceph
> project (running for already 9 Monthes).
> 
> Regards,
> Christian
> 
> Am 29.12.2014 15:59, schrieb Nico Schottelius:
> > Hey Christian,
> > 
> > Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
> >> [incomplete PG / RBD hanging, osd lost also not helping]
> > 
> > that is very interesting to hear, because we had a similar situation
> > with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
> > directories to allow OSDs to start after the disk filled up completly.
> > 
> > So I am sorry not to being able to give you a good hint, but I am very
> > interested in seeing your problem solved, as it is a show stopper for
> > us, too. (*)
> > 
> > Cheers,
> > 
> > Nico
> > 
> > (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
> >     seems to run much smoother. The first one is however not supported
> >     by opennebula directly, the second one not flexible enough to host
> >     our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
> >     are using ceph at the moment.
> > 
> 
> 
> -- 
> Christian Eichelmann
> Systemadministrator
> 
> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
> Brauerstraße 48 · DE-76135 Karlsruhe
> Telefon: +49 721 91374-8026
> christian.eichelmann@xxxxxxxx
> 
> Amtsgericht Montabaur / HRB 6484
> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
> Aufsichtsratsvorsitzender: Michael Scheeren

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com