Re: Help Recovering Ceph cluster

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 3 Jul 2013 16:51:07 -0700

Hey Jon,
Sorry nobody's been able to help you so far; I think your emails must
have fallen into the cracks. :( I'm going to go through and try to
address some of the things that sound like they might still be
relevant...

On Tue, Jul 2, 2013 at 5:05 PM, Jon <three18ti@xxxxxxxxx> wrote:
> Now if I could figure out the exact same issue on my other host...

Which issue are you currently seeing on your other host? You mentioned
several in your first two emails and it didn't sound like they were
all going on at the same time.

On Thu, Jun 27, 2013 at 9:34 AM, Jon <three18ti@xxxxxxxxx> wrote:

> The last recorded error was an "Unable to open superblock" error, that was
> several weeks ago, but I think that coincides with the initial trouble I
> experienced.  I have tested these disks and can confirm that they have not
> failed.

This in combination with the checksum error below makes it sound like
there's some kind of issue going on with your disks, though. :/ I see
you mentioned at least one power outage, though Ceph should generally
survive those as it's quite careful about disk commit orders. What's
your underlying FS, and are you using any options when mounting it
that might reduce safety (the most common one is setting
nobarrier=true) or using a raid card that might have similar safety
configuration issues?

> Any help is greatly appreciated as I am really stumped.
> I think my biggest frustration is the init scripts not working as described
> in the docs.  After I use ceph-deploy, do I need to write a config file?
> based on my interpretation of the docs and upstart scripts, I don't think
> so; the respective daemons start on boot...

Hmm, what OS are you using? ceph-deploy does not require writing any
additional config files (unless you want to for some reason); the
modern scripts auto-detect disks of the appropriate type and folders
in the appropriate locations and start up that way.
Your issue with things starting up in the wrong order sounds like some
of the init system ordering troubles we've run into with systemd and
other non-upstart, non-sysvinit systems and we have some recent
patches that should fix that up.

On Sun, Jun 9, 2013 at 11:36 AM, Jon <three18ti@xxxxxxxxx> wrote:
> I've tried a number of things in the docs, but something seems amiss because
> when I try to restart monitors or osds, the init script tells me it's not
> found.
> I've copied my ceph.conf at the end of this e-mail.
>
>>> root@shepard:~# ls /var/lib/ceph/mon/
>>> ceph-shepard
>>> root@shepard:~# /etc/init.d/ceph restart mon.shepard
>>> /etc/init.d/ceph: mon.shepard not found (/etc/ceph/ceph.conf defines ,
>>> /var/lib/ceph defines )
> (I've also tried mon.0 .. mon.3 )

Hmm, what are the contents of /var/lib/ceph and its subdirectories?

> My last question is where does ceph-deploy create the configs?  I have the
> original files in a directory where I ran ceph deploy,
> and I know about the /etc/ceph/ceph.conf file, but there seems to be some
> other config that the cluster is pulling from.
> Maybe I'm mistaken, but there are no osds in my ceph.conf.

Right. You can specify daemons in one of a couple of ways:
1) put them in the ceph.conf.
2) Tag OSD disks appropriately
3) put the data directories for those daemons in the standard system
locations (/var/lib/ceph/*).

We are in general moving away from having a single monolithic config
that lists every daemon, because it's basically a lie — even with a
monolithic config, each daemon is looking at its local copy so if
those disagree then the stuff that works based on config contents (a
select number of sysvinit commands) will behave differently across
those hosts. ceph-deploy chooses to define daemons based on the third
and second methods under most cases, because it's designed to allow
incremental changes and doesn't want to have to handle simultaneous
changes to the conf on a remote host. So when you create a cluster it
gives you a skeleton config to modify as you see fit, then it pushes
that out to every node you specify a daemon, but doesn't change it
when you add new daemons. When you turn on Ceph, the init system looks
for existing daemon data stores and turns them on, and they read in
the contents of that minimal ceph.conf in order to find the monitor
IPs and any other config options you might have specified for their
daemon type.

Given your odd disk sizes, I'm thinking they aren't real disks, and so
you're either losing the partition types that ceph-disk sets or else
they aren't located in quite the right place for the init system to
find them. So again, what are the contents of /var/lib/ceph and its
subdirectories on one of your working nodes? :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com