Hi Christian, thank you very much for your hint! I am usually using the search function of the mailing list archiv and didnt find this. I installed munin on all nodes to get a better overview what happens where to a specific time. When the problem happens, munin does not receive/show any values for IOs for some HDDs ( but not neccessarily the HDD which was op't out ), same with avarage latency. The Disk utilization shows, that its sometimes peeking to 100% for different HDD's, but also then, its not neccessarily op'ing out the specific HDD. But all in all, your asumption that the hdd's are overloaded seems right. I use now ceph osd nodown ceph osd noout to prevent the cluster from op'ing out the hdds. I hope that works. What i saw too in the logs: 2016-03-02 07:49:52.435680 7fd85c89a700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7fd85c89a700 time 2016-03-02 07:49:52.273117 osd/ReplicatedBackend.cc: 1482: FAILED assert(get_parent()->get_log().get_log().objects.count(soid) && (get_parent()->get_log().get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT) && (get_parent()->get_log().get_log().objects.find( soid)->second->reverting_to == v)) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195] 2: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0xcaa) [0xa0e34a] 3: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::tr1::shared_ptr<ObjectContext>, std::tr1::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0x29a) [0xa0f98a] 4: (ReplicatedPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x602) [0x86b4b2] 5: (ReplicatedPG::wait_for_unreadable_object(hobject_t const&, std::tr1::shared_ptr<OpRequest>)+0x49a) [0x87055a] 6: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x7fd) [0x89473d] 7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x833e1a] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ed) [0x69586d] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x695d59] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbb889f] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbba9d0] 12: (()+0x7dc5) [0x7fd87d87edc5] 13: (clone()+0x6d) [0x7fd87c36028d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.14.log What exactly does this mean ? Thank you ! -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:info@xxxxxxxxxxxxxxxxx Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 01.03.2016 um 03:13 schrieb Christian Balzer: > > Hello, > > googling for "ceph wrong node" gives us this insightful thread: > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg09960.html > > I suggest reading through it, more below: > > On Mon, 29 Feb 2016 15:30:41 +0100 Oliver Dzombic wrote: > >> Hi, >> >> i face here some trouble with the cluster. >> >> Suddenly "random" OSD's are getting marked out. >> >> After restarting the OSD on the specific node, its working again. >> > Matches the scenario mentioned above. > >> This happens usually during activated scrubbing/deep scrubbing. >> > I guess your cluster is very much overloaded on some level, use atop or > similar tools to find out what needs improvement. > > Also, as always, versions of all SW/kernel, a HW description, output of > "ceph -s" etc. will help people identify possible problem spots or to > correlate this to other things. > > Christian > >> In the logs i can see: >> >> 2016-02-29 06:08:58.130376 7fd5dae75700 0 -- 10.0.1.2:0/36459 >> >> 10.0.0.4:6807/9051245 pipe(0x27488000 sd=58 :60473 s=1 pgs=0 cs=0 l=1 >> c=0x28b39440).connect claims to be 10.0.0.4:6807/12051245 not >> 10.0.0.4:6807/9051245 - wrong node! >> 2016-02-29 06:08:58.130417 7fd5d9961700 0 -- 10.0.1.2:0/36459 >> >> 10.0.1.4:6803/6002429 pipe(0x2a6c9000 sd=75 :37736 s=1 pgs=0 cs=0 l=1 >> c=0x2420be40).connect claims to be 10.0.1.4:6803/10002429 not >> 10.0.1.4:6803/6002429 - wrong node! >> 2016-02-29 06:08:58.130918 7fd5b1c17700 0 -- 10.0.1.2:0/36459 >> >> 10.0.0.1:6800/8050402 pipe(0x26834000 sd=74 :37605 s=1 pgs=0 cs=0 l=1 >> c=0x1f7a9020).connect claims to be 10.0.0.1:6800/9050770 not >> 10.0.0.1:6800/8050402 - wrong node! >> 2016-02-29 06:08:58.131266 7fd5be141700 0 -- 10.0.1.2:0/36459 >> >> 10.0.0.3:6806/9059302 pipe(0x27f07000 sd=76 :48347 s=1 pgs=0 cs=0 l=1 >> c=0x2371adc0).connect claims to be 10.0.0.3:6806/11059302 not >> 10.0.0.3:6806/9059302 - wrong node! >> 2016-02-29 06:08:58.131299 7fd5c1914700 0 -- 10.0.1.2:0/36459 >> >> 10.0.1.4:6801/9051245 pipe(0x2d288000 sd=100 :33848 s=1 pgs=0 cs=0 l=1 >> c=0x28b37760).connect claims to be 10.0.1.4:6801/12051245 not >> 10.0.1.4:6801/9051245 - wrong node! >> >> and >> >> 2016-02-29 06:08:59.230754 7fd5c5425700 -1 osd.3 14877 heartbeat_check: >> no reply from osd.0 since back 2016-02-29 05:55:26.351951 front >> 2016-02-29 05:55:26.351951 (cutoff 2016-02-29 06:08:39.230753) >> 2016-02-29 06:08:59.230761 7fd5c5425700 -1 osd.3 14877 heartbeat_check: >> no reply from osd.1 since back 2016-02-29 05:41:59.191341 front >> 2016-02-29 05:41:59.191341 (cutoff 2016-02-29 06:08:39.230753) >> 2016-02-29 06:08:59.230765 7fd5c5425700 -1 osd.3 14877 heartbeat_check: >> no reply from osd.2 since back 2016-02-29 05:41:59.191341 front >> 2016-02-29 05:41:59.191341 (cutoff 2016-02-29 06:08:39.230753) >> 2016-02-29 06:08:59.230769 7fd5c5425700 -1 osd.3 14877 heartbeat_check: >> no reply from osd.4 since back 2016-02-29 05:55:30.452505 front >> 2016-02-29 05:55:30.452505 (cutoff 2016-02-29 06:08:39.230753) >> 2016-02-29 06:08:59.230773 7fd5c5425700 -1 osd.3 14877 heartbeat_check: >> no reply from osd.7 since back 2016-02-29 05:41:52.790422 front >> 2016-02-29 05:41:52.790422 (cutoff 2016-02-29 06:08:39.230753) >> >> >> Any idea what could be the trouble of the cluster ? >> >> Thank you ! >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com