Re: Ceph stops responding

Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> · Wed, 05 Mar 2014 13:23:40 +0200

Actually there are two monitors (my bad in the previous e-mail).
One at the MASTER and one at the CLIENT.

The monitor in CLIENT is failing with the following

2014-03-05 13:08:38.821135 7f76ba82b700  1 
mon.client1@0(leader).paxos(paxos active c 25603..26314) is_readable 
now=2014-03-05 13:08:38.821136 lease_expire=2014-03-05 13:08:40.845978 
has v0 lc 26314
2014-03-05 13:08:40.599287 7f76bb22c700  0 
mon.client1@0(leader).data_health(86) update_stats avail 4% total 
51606140 used 46645692 avail 2339008
2014-03-05 13:08:40.599527 7f76bb22c700 -1 
mon.client1@0(leader).data_health(86) reached critical levels of 
available space on data store -- shutdown!
2014-03-05 13:08:40.599530 7f76bb22c700  0 ** Shutdown via Data Health 
Service **
2014-03-05 13:08:40.599557 7f76b9328700 -1 mon.client1@0(leader) e2 *** 
Got Signal Interrupt ***
2014-03-05 13:08:40.599568 7f76b9328700  1 mon.client1@0(leader) e2 
shutdown
2014-03-05 13:08:40.599602 7f76b9328700  0 quorum service shutdown
2014-03-05 13:08:40.599609 7f76b9328700  0 
mon.client1@0(shutdown).health(86) HealthMonitor::service_shutdown 1 
services
2014-03-05 13:08:40.599613 7f76b9328700  0 quorum service shutdown

The thing is that there is plenty of space in that host (CLIENT)

# df -h
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/vg_one-lv_root     50G    45G  2.3G  96% /
tmpfs                          5.9G     0  5.9G   0% /dev/shm
/dev/sda1                      485M   76M  384M  17% /boot
/dev/mapper/vg_one-lv_home     862G   249G 569G  31% /home

On the other hand the other host (MASTER) is running low on disk space 
(93% is full).

But why is the CLIENT failing while the MASTER is still running even 
though is running low on disk space?

I 'll try to free some space and see what happens next...

Best,

G.

On Wed, 05 Mar 2014 11:50:57 +0100, Wido den Hollander wrote:
On 03/05/2014 11:21 AM, Georgios Dimitrakakis wrote:
My setup consists of two nodes.

The first node (master) is running:

-mds
-mon
-osd.0

and the second node (CLIENT) is running:

-osd.1

Therefore I 've restarted ceph services on both nodes

Leaving the "ceph -w" running for as long as it can after a few 
seconds
the error that is produced is this:

2014-03-05 12:08:17.715699 7fba13fff700  0 monclient: hunting for 
new mon
2014-03-05 12:08:17.716108 7fba102f8700  0 -- 192.168.0.10:0/1008298 
>>
X.Y.Z.X:6789/0 pipe(0x7fba08008e50 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7fba080090b0).fault

(where X.Y.Z.X is the public IP of the CLIENT node).

And it keep goes on...

"ceph-health" after a few minutes shows the following

2014-03-05 12:12:58.355677 7effc52fb700  0 monclient(hunting):
authenticate timed out after 300
2014-03-05 12:12:58.355717 7effc52fb700  0 librados: client.admin
authentication error (110) Connection timed out
Error connecting to cluster: TimedOut

Any ideas now??

Is the monitor actually running on the first node? If not, checked
the logs in /var/log/ceph as to why it isn't running.

Or maybe you just need to start it.

Wido

Best,

G.

On Wed, 5 Mar 2014 15:10:25 +0530, Srinivasa Rao Ragolu wrote:
First try to start OSD nodes by restarting the ceph service on ceph
nodes. If it works file then you could able to see ceph-osd process
running in process list. And do not need to add any public or 
private
network in ceph.conf. If none of the OSDs run then you need to
reconfigure them from monitor node.

Please check ceph-mon process is running on monitor node or not?
ceph-mds should not run.

also check /etc/hosts file with valid ip address of cluster nodes

Finally check ceph.client.admin.keyring and 
ceph.bootstrap-osd.keyring
should be matched in all the cluster nodes.

Best of luck.
Srinivas.

On Wed, Mar 5, 2014 at 3:04 PM, Georgios Dimitrakakis  wrote:

Hi!

I have installed ceph and created two osds and was very happy with
that but apparently not everything was correct.

Today after a system reboot the cluster comes up and for a few
moments it seems that its ok (using the "ceph health" command) but
after a few seconds the "ceph health" command doesnt produce any
output at all.

It justs stays there without anything on the screen...

ceph -w is doing the same as well...

If I restart the ceph services ("service ceph restart") again for 
a
few seconds is working but after a few more it stays frozen.

Initially I thought that this was a firewall problem but 
apparently
it isnt.

Then I though that this had to do with the

public_network

cluster_network

not defined in ceph.conf and changed that.

No matter whatever I do the cluster works for a few seconds after
the service restart and then it stops responding...

Any help much appreciated!!!

Best,

G.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx [1]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2]

Links:
------
[1] mailto:ceph-users@xxxxxxxxxxxxxx
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[3] mailto:giorgis@xxxxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com