Re: All OSDs don't restart after shutdown

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/06/2014 12:36 PM, Antonio Messina wrote:
On Thu, Nov 6, 2014 at 12:00 PM, Luca Mazzaferro
<luca.mazzaferro@xxxxxxxxxx> wrote:
Dear Users,
Hi Luca,

On the admin-node side the ceph healt command or the ceph -w hangs forever.
I'm not a ceph expert either, but this is usually an indication that
the monitors are not running.

How many MONs are you running? Are they all alive? What's in the mon
logs? Also check the time on the mon nodes.

cheers,
Antonio

Ciao Antonio,
thank you very much for your answer.

I'm running 3 MONs and they are all alive.

The logs doesn't shows any problem that I can recognize.
This is a section after a restart from the "initial monitor":

2014-11-06 14:31:36.795298 7fb66e4867a0 0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-mon, pid 28050 2014-11-06 14:31:36.860884 7fb66e4867a0 0 starting mon.ceph-node1 rank 0 at 192.168.122.21:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-node1 fsid 62e03428-0c4a-4ede-be18-c2cfed10639d 2014-11-06 14:31:36.861383 7fb66e4867a0 1 mon.ceph-node1@-1(probing) e3 preinit fsid 62e03428-0c4a-4ede-be18-c2cfed10639d 2014-11-06 14:31:36.862614 7fb66e4867a0 1 mon.ceph-node1@-1(probing).paxosservice(pgmap 1..218) refresh upgraded, format 0 -> 1 2014-11-06 14:31:36.862666 7fb66e4867a0 1 mon.ceph-node1@-1(probing).pg v0 on_upgrade discarding in-core PGMap 2014-11-06 14:31:36.866958 7fb66e4867a0 0 mon.ceph-node1@-1(probing).mds e4 print_map
epoch    4
flags    0
created    2014-11-04 12:30:56.224692
modified    2014-11-05 13:00:53.377356
tableserver    0
root    0
session_timeout    60
session_autoclose    300
max_file_size    1099511627776
last_failure    0
last_failure_osd_epoch    0
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap}
max_mds    1
in    0
up    {0=4243}
failed
stopped
data_pools    0
metadata_pool    1
inline_data    disabled
4243:    192.168.122.21:6805/28039 'ceph-node1' mds.0.1 up:active seq 2

2014-11-06 14:31:36.867144 7fb66e4867a0 0 mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, adjusting msgr requires 2014-11-06 14:31:36.867155 7fb66e4867a0 0 mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, adjusting msgr requires 2014-11-06 14:31:36.867157 7fb66e4867a0 0 mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, adjusting msgr requires 2014-11-06 14:31:36.867159 7fb66e4867a0 0 mon.ceph-node1@-1(probing).osd e15 crush map has features 1107558400, adjusting msgr requires 2014-11-06 14:31:36.867850 7fb66e4867a0 1 mon.ceph-node1@-1(probing).paxosservice(auth 1..37) refresh upgraded, format 0 -> 1 2014-11-06 14:31:36.868898 7fb66e4867a0 0 mon.ceph-node1@-1(probing) e3 my rank is now 0 (was -1) 2014-11-06 14:31:36.869655 7fb666410700 0 -- 192.168.122.21:6789/0 >> 192.168.122.22:6789/0 pipe(0x2b18a00 sd=22 :0 s=1 pgs=0 cs=0 l=0 c=0x2950c60).fault 2014-11-06 14:31:36.869817 7fb66630f700 0 -- 192.168.122.21:6789/0 >> 192.168.122.23:6789/0 pipe(0x2b19680 sd=21 :0 s=1 pgs=0 cs=0 l=0 c=0x29518c0).fault 2014-11-06 14:31:52.224266 7fb66580d700 0 -- 192.168.122.21:6789/0 >> 192.168.122.22:6789/0 pipe(0x2b1be80 sd=23 :6789 s=0 pgs=0 cs=0 l=0 c=0x2951b80).accept connect_seq 0 vs existing 0 state connecting 2014-11-06 14:31:57.987230 7fb66570c700 0 -- 192.168.122.21:6789/0 >> 192.168.122.23:6789/0 pipe(0x2b1d280 sd=24 :6789 s=0 pgs=0 cs=0 l=0 c=0x2951ce0).accept connect_seq 0 vs existing 0 state connecting 2014-11-06 14:32:36.868421 7fb668213700 0 mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 8563152 used 6364364 avail 1763796 2014-11-06 14:32:36.868739 7fb668213700 0 log [WRN] : reached concerning levels of available space on local monitor storage (20% free) 2014-11-06 14:33:36.869029 7fb668213700 0 mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 8563152 used 6364364 avail 1763796 2014-11-06 14:34:36.869285 7fb668213700 0 mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 8563152 used 6364364 avail 1763796 2014-11-06 14:35:36.869588 7fb668213700 0 mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 8563152 used 6364364 avail 1763796 2014-11-06 14:36:36.869910 7fb668213700 0 mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 8563152 used 6364364 avail 1763796 2014-11-06 14:37:36.870395 7fb668213700 0 mon.ceph-node1@0(probing).data_health(0) update_stats avail 20% total 8563152 used 6364364 avail 1763796


Instead from my admin node waiting for about 5 minutes I got this:
[rzgceph@admin-node my-cluster]$ ceph -s
2014-11-06 12:18:43.723751 7f3f5d645700 0 monclient(hunting): authenticate timed out after 300 2014-11-06 12:18:43.723848 7f3f5d645700 0 librados: client.admin authentication error (110) Connection timed out

Which leads me to this discussion:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/036922.html

unfortunately not ended.
I also checked the permissions on the keys but they seems to be ok.

About the time: it is synchronized via ntpd and it is working regularly.

Do I need to clean everything and start again from scratch?
Thank you.
Cheers.

    Luca

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux