Hello Community Need Help with my production Ceph cluster were multiple OSDs are getting crashed after throwing this error 2015-08-11 16:01:19.617860 7f3d95219700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use 2015-08-11 16:01:19.618929 7f3d95219700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use I am seeing this problem second time in last 4 days , earlier i restart OSD services and they worked initially. But today again OSD’s broke. Here is the backtrack -10> 2015-08-10 12:38:02.766359 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.33 ever on either front or back, first ping sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354) -9> 2015-08-10 12:38:02.766423 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.50 ever on either front or back, first ping sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354) -8> 2015-08-10 12:38:02.766433 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.134 ever on either front or back, first ping sent 2015-08-10 12:37:23.469422 (cutoff 2015-08-10 12:37:42.766354) -7> 2015-08-10 12:38:02.766446 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.200 ever on either front or back, first ping sent 2015-08-10 12:37:15.361731 (cutoff 2015-08-10 12:37:42.766354) -6> 2015-08-10 12:38:02.766454 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.228 ever on either front or back, first ping sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354) -5> 2015-08-10 12:38:03.259647 7fa9b5b9a700 0 -- 10.100.50.2:0/82807 >> 10.100.50.4:7142/147030592 pipe(0x4ff3200 sd=399 :0 s=1 pgs=0 cs=0 l=1 c=0x44b3de0).fault -4> 2015-08-10 12:38:03.259682 7fa9b5594700 0 -- 10.100.50.2:0/82807 >> 10.100.50.1:7204/408026440 pipe(0xf278f00 sd=411 :0 s=1 pgs=0 cs=0 l=1 c=0x44b7bc0).fault -3> 2015-08-10 12:38:03.271675 7fa9ecda2700 0 log [WRN] : map e39763 wrongly marked me down -2> 2015-08-10 12:38:03.306073 7fa9ecda2700 -1 accepter.accepter.bind unable to bind to 10.100.50.2:7300 on any port in range 6800-7300: (98) Address already in use -1> 2015-08-10 12:38:03.368817 7fa9ecda2700 0 osd.60 39763 prepare_to_stop starting shutdown 0> 2015-08-10 12:38:03.372071 7fa9ecda2700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa9ecda2700 time 2015-08-10 12:38:03.368886 common/Mutex.cc: 93: FAILED assert(r == 0) ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047) 1: (Mutex::Lock(bool)+0x1d3) [0xa83003] 2: (OSD::shutdown()+0x63) [0x63f3f3] 3: (OSD::handle_osd_map(MOSDMap*)+0x1829) [0x64dff9] 4: (OSD::_dispatch(Message*)+0x2fb) [0x6600eb] 5: (OSD::ms_dispatch(Message*)+0x211) [0x6607b1] 6: (DispatchQueue::entry()+0x5a2) [0xb5ac12] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf23ad] 8: /lib64/libpthread.so.0() [0x35952079d1] 9: (clone()+0x6d) [0x3594ee89dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. My Environment ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047) Kernel : 2.6.32-431.el6.x86_64 CentOS release 6.5 (Final) I have 4 OSD nodes but just 2 of them has shown this error I have reported this under http://tracker.ceph.com/issues/12655 **************************************************************** Karan Singh Systems Specialist , Storage Platforms CSC - IT Center for Science, Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland mobile: +358 503 812758 tel. +358 9 4572001 fax +358 9 4572302 http://www.csc.fi/ **************************************************************** |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com