Re: All OSDs don't restart after shutdown

Luca Mazzaferro <luca.mazzaferro@xxxxxxxxxx> · Thu, 06 Nov 2014 14:49:53 +0100

On 11/06/2014 12:36 PM, Antonio Messina wrote:
On Thu, Nov 6, 2014 at 12:00 PM, Luca Mazzaferro
<luca.mazzaferro@xxxxxxxxxx> wrote:
Dear Users,
Hi Luca,

On the admin-node side the ceph healt command or the ceph -w hangs forever.
I'm not a ceph expert either, but this is usually an indication that
the monitors are not running.

How many MONs are you running? Are they all alive? What's in the mon
logs? Also check the time on the mon nodes.

cheers,
Antonio

Ciao Antonio,
thank you very much for your answer.

I'm running 3 MONs and they are all alive.

The logs doesn't shows any problem that I can recognize.
This is a section after a restart from the "initial monitor":

2014-11-06 14:31:36.795298 7fb66e4867a0  0 ceph version 0.80.7 
(6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-mon, pid 28050
2014-11-06 14:31:36.860884 7fb66e4867a0  0 starting mon.ceph-node1 rank 
0 at 192.168.122.21:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-node1 
fsid 62e03428-0c4a-4ede-be18-c2cfed10639d
2014-11-06 14:31:36.861383 7fb66e4867a0  1 mon.ceph-node1@-1(probing) e3 
preinit fsid 62e03428-0c4a-4ede-be18-c2cfed10639d
2014-11-06 14:31:36.862614 7fb66e4867a0  1 
mon.ceph-node1@-1(probing).paxosservice(pgmap 1..218) refresh upgraded, 
format 0 -> 1
2014-11-06 14:31:36.862666 7fb66e4867a0  1 mon.ceph-node1@-1(probing).pg 
v0 on_upgrade discarding in-core PGMap
2014-11-06 14:31:36.866958 7fb66e4867a0  0 
mon.ceph-node1@-1(probing).mds e4 print_map
epoch    4
flags    0
created    2014-11-04 12:30:56.224692
modified    2014-11-05 13:00:53.377356
tableserver    0
root    0
session_timeout    60
session_autoclose    300
max_file_size    1099511627776
last_failure    0
last_failure_osd_epoch    0
compat    compat={},rocompat={},incompat={1=base v0.20,2=client 
writeable ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds uses versioned encoding,6=dirfrag is stored in omap}
max_mds    1
in    0
up    {0=4243}
failed
stopped
data_pools    0
metadata_pool    1
inline_data    disabled
4243:    192.168.122.21:6805/28039 'ceph-node1' mds.0.1 up:active seq 2

2014-11-06 14:31:36.867144 7fb66e4867a0  0 
mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, 
adjusting msgr requires
2014-11-06 14:31:36.867155 7fb66e4867a0  0 
mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, 
adjusting msgr requires
2014-11-06 14:31:36.867157 7fb66e4867a0  0 
mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, 
adjusting msgr requires
2014-11-06 14:31:36.867159 7fb66e4867a0  0 
mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, 
adjusting msgr requires
2014-11-06 14:31:36.867850 7fb66e4867a0  1 
mon.ceph-node1@-1(probing).paxosservice(auth 1..37) refresh upgraded, 
format 0 -> 1
2014-11-06 14:31:36.868898 7fb66e4867a0  0 mon.ceph-node1@-1(probing) 
e3  my rank is now 0 (was -1)
2014-11-06 14:31:36.869655 7fb666410700  0 -- 192.168.122.21:6789/0 >> 
192.168.122.22:6789/0 pipe(0x2b18a00 sd=22 :0 s=1 pgs=0 cs=0 l=0 
c=0x2950c60).fault
2014-11-06 14:31:36.869817 7fb66630f700  0 -- 192.168.122.21:6789/0 >> 
192.168.122.23:6789/0 pipe(0x2b19680 sd=21 :0 s=1 pgs=0 cs=0 l=0 
c=0x29518c0).fault
2014-11-06 14:31:52.224266 7fb66580d700  0 -- 192.168.122.21:6789/0 >> 
192.168.122.22:6789/0 pipe(0x2b1be80 sd=23 :6789 s=0 pgs=0 cs=0 l=0 
c=0x2951b80).accept connect_seq 0 vs existing 0 state connecting
2014-11-06 14:31:57.987230 7fb66570c700  0 -- 192.168.122.21:6789/0 >> 
192.168.122.23:6789/0 pipe(0x2b1d280 sd=24 :6789 s=0 pgs=0 cs=0 l=0 
c=0x2951ce0).accept connect_seq 0 vs existing 0 state connecting
2014-11-06 14:32:36.868421 7fb668213700  0 
mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 
8563152 used 6364364 avail 1763796
2014-11-06 14:32:36.868739 7fb668213700  0 log [WRN] : reached 
concerning levels of available space on local monitor storage (20% free)
2014-11-06 14:33:36.869029 7fb668213700  0 
mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 
8563152 used 6364364 avail 1763796
2014-11-06 14:34:36.869285 7fb668213700  0 
mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 
8563152 used 6364364 avail 1763796
2014-11-06 14:35:36.869588 7fb668213700  0 
mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 
8563152 used 6364364 avail 1763796
2014-11-06 14:36:36.869910 7fb668213700  0 
mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 
8563152 used 6364364 avail 1763796
2014-11-06 14:37:36.870395 7fb668213700  0 
mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 
8563152 used 6364364 avail 1763796

Instead from my admin node waiting for about 5 minutes I got this:
[rzgceph@admin-node my-cluster]$ ceph -s
2014-11-06 12:18:43.723751 7f3f5d645700  0 monclient(hunting): 
authenticate timed out after 300
2014-11-06 12:18:43.723848 7f3f5d645700  0 librados: client.admin 
authentication error (110) Connection timed out

Which leads me to this discussion:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/036922.html

unfortunately not ended.
I also checked the permissions on the keys but they seems to be ok.

About the time: it is synchronized via ntpd and it is working regularly.

Do I need to clean everything and start again from scratch?
Thank you.
Cheers.

    Luca

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com