Thanks Wang, looks like so, not Ceph to blame :)
On 25 October 2016 at 09:59, Haomai Wang <haomai@xxxxxxxx> wrote:
could you check dmesg? I think there exists disk EIO errorOn Tue, Oct 25, 2016 at 9:58 AM, Zhang Qiang <dotslash.lu@xxxxxxxxx> wrote:______________________________Hi,One of several OSDs on the same machine crashed several times within days. It's always that one, other OSDs are all fine. Below is the dumped message, since it's too long here, I only pasted the head and tail of the recent events. If it's necessary to inspect the full log, please see https://gist.github.com/dotSlashLu/3e8ca9491fbf07636a4583244 .ac23f80 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24 18:52:06.213123os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195]2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1] 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53] 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2) [0x7df722] 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbe) [0x6dcade] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] 10: (()+0x7dc5) [0x7f309cd26dc5]11: (clone()+0x6d) [0x7f309b80821d]NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.--- begin dump of recent events ----10000> 2016-10-24 18:51:34.341035 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.56:6821/4808 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x175a2c00 con 0x1526a940-9999> 2016-10-24 18:51:34.341046 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.61:6817/4808 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x175a3600 con 0x15269fa0-9998> 2016-10-24 18:51:34.341058 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.56:6823/5402 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x12aaa400 con 0x27bc9080-9997> 2016-10-24 18:51:34.341069 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.61:6821/5402 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x1f89ec00 con 0x27bc91e0-9996> 2016-10-24 18:51:34.341080 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.56:6824/6216 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0xaa16000 con 0x175b0c00-9995> 2016-10-24 18:51:34.341090 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.61:6818/6216 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x23b87800 con 0x175ae160-9994> 2016-10-24 18:51:34.341101 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.57:6802/23367 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x258ed400 con 0x17500d60-9993> 2016-10-24 18:51:34.341113 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.62:6806/23367 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x242bb000 con 0x175019c0-9992> 2016-10-24 18:51:34.341128 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.57:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x28e41c00 con 0x1744aec0-9991> 2016-10-24 18:51:34.341139 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.62:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x10be5200 con 0x175bf8c0-9990> 2016-10-24 18:51:34.341130 7f3088a48700 1 -- 10.3.149.62:0/25857 <== osd.1 10.3.149.55:6835/2010188 187557 ==== osd_ping(ping_reply e3014 stamp 2016-10-24 18:51:34.340550) v2 ==== 47+0+0 (1550182756 0 0) 0x1a83bc00 con 0x7874580-9989> 2016-10-24 18:51:34.341151 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.57:6814/26469 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x1f48aa00 con 0x175bfa20-9988> 2016-10-24 18:51:34.341162 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.62:6811/26469 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x24456e00 con 0x175bfb80-9987> 2016-10-24 18:51:34.341174 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.58:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x25c59e00 con 0x7874f20-9986> 2016-10-24 18:51:34.341186 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.63:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x19703c00 con 0x7875760-9985> 2016-10-24 18:51:34.341208 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.58:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x19702600 con 0x26444940-9984> 2016-10-24 18:51:34.341231 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.63:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0xa67da00 con 0x7874c60-9983> 2016-10-24 18:51:34.341249 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.58:6809/2023604 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x22111000 con 0x17887860-9982> 2016-10-24 18:51:34.341262 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.63:6811/2023604 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x1fe62200 con 0x17887de0-9981> 2016-10-24 18:51:34.341281 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.58:6802/2023892 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x1fc32c00 con 0x24246100-9980> 2016-10-24 18:51:34.341297 7f307b22d700 1 -- 10.3.149.62:0/25857 --> 10.3.149.63:6801/2023892 -- osd_ping(ping e3014 stamp 2016-10-24 18:51:34.340550) v2 -- ?+0 0x20544c00 con 0x24246d60...-20> 2016-10-24 18:52:05.273121 7f3086243700 1 -- 10.3.149.57:6811/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x27c1a600 con 0x1744aaa0-19> 2016-10-24 18:52:05.273129 7f3087a46700 1 -- 10.3.149.62:6810/25857 <== osd.1 10.3.149.60:0/10188 187279 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.212809) v2 ==== 47+0+0 (387409057 0 0) 0x1ff4f600 con 0x175b1860-18> 2016-10-24 18:52:05.273157 7f3087a46700 1 -- 10.3.149.62:6810/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x10d73a00 con 0x175b1860-17> 2016-10-24 18:52:05.641202 7f3086243700 1 -- 10.3.149.57:6811/25857 <== osd.29 10.3.149.59:0/35501 187818 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0 (3027252596 0 0) 0x9d0a200 con 0x175172e0-16> 2016-10-24 18:52:05.641209 7f3087a46700 1 -- 10.3.149.62:6810/25857 <== osd.29 10.3.149.59:0/35501 187818 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0 (3027252596 0 0) 0xa27ba00 con 0x264422c0-15> 2016-10-24 18:52:05.641246 7f3086243700 1 -- 10.3.149.57:6811/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1b8a6200 con 0x175172e0-14> 2016-10-24 18:52:05.641290 7f3087a46700 1 -- 10.3.149.62:6810/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1ff4f600 con 0x264422c0-13> 2016-10-24 18:52:05.689610 7f3086243700 1 -- 10.3.149.57:6811/25857 <== osd.13 10.3.149.56:0/5402 187624 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0 (1310408758 0 0) 0x1be24600 con 0x15268b00-12> 2016-10-24 18:52:05.689664 7f3086243700 1 -- 10.3.149.57:6811/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0x9d0a200 con 0x15268b00-11> 2016-10-24 18:52:05.689661 7f3087a46700 1 -- 10.3.149.62:6810/25857 <== osd.13 10.3.149.56:0/5402 187624 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0 (1310408758 0 0) 0x19705600 con 0x175b1de0-10> 2016-10-24 18:52:05.689729 7f3087a46700 1 -- 10.3.149.62:6810/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0xa27ba00 con 0x175b1de0-9> 2016-10-24 18:52:05.861925 7f3086243700 1 -- 10.3.149.57:6811/25857 <== osd.4 10.3.149.60:0/12742 187653 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0 (350590821 0 0) 0x12169400 con 0x17514000-8> 2016-10-24 18:52:05.861957 7f3086243700 1 -- 10.3.149.57:6811/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x1be24600 con 0x17514000-7> 2016-10-24 18:52:05.861963 7f3087a46700 1 -- 10.3.149.62:6810/25857 <== osd.4 10.3.149.60:0/12742 187653 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0 (350590821 0 0) 0x269fba00 con 0x26442840-6> 2016-10-24 18:52:05.862015 7f3087a46700 1 -- 10.3.149.62:6810/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x19705600 con 0x26442840-5> 2016-10-24 18:52:05.882605 7f3094bb6700 5 osd.19 3014 tick-4> 2016-10-24 18:52:05.988572 7f3086243700 1 -- 10.3.149.57:6811/25857 <== osd.25 10.3.149.58:0/24382 187898 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0 (3778423740 0 0) 0xae91200 con 0x177bb760-3> 2016-10-24 18:52:05.988582 7f3087a46700 1 -- 10.3.149.62:6810/25857 <== osd.25 10.3.149.58:0/24382 187898 ==== osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0 (3778423740 0 0) 0x1a396000 con 0x1526bc80-2> 2016-10-24 18:52:05.988608 7f3086243700 1 -- 10.3.149.57:6811/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x12169400 con 0x177bb760-1> 2016-10-24 18:52:05.988652 7f3087a46700 1 -- 10.3.149.62:6810/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x269fba00 con 0x1526bc800> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24 18:52:06.213123os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9195]2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1] 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53] 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2) [0x7df722] 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbe) [0x6dcade] 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] 10: (()+0x7dc5) [0x7f309cd26dc5]11: (clone()+0x6d) [0x7f309b80821d]NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.--- logging levels ---0/ 5 none0/ 1 lockdep0/ 1 context1/ 1 crush1/ 5 mds1/ 5 mds_balancer1/ 5 mds_locker1/ 5 mds_log1/ 5 mds_log_expire1/ 5 mds_migrator0/ 1 buffer0/ 1 timer0/ 1 filer0/ 1 striper0/ 1 objecter0/ 5 rados0/ 5 rbd0/ 5 rbd_replay0/ 5 journaler0/ 5 objectcacher0/ 5 client0/ 5 osd0/ 5 optracker0/ 5 objclass1/ 3 filestore1/ 3 keyvaluestore1/ 3 journal0/ 5 ms1/ 5 mon0/10 monc1/ 5 paxos0/ 5 tp1/ 5 auth1/ 5 crypto1/ 1 finisher1/ 5 heartbeatmap1/ 5 perfcounter1/ 5 rgw1/10 civetweb1/ 5 javaclient1/ 5 asok1/ 1 throttle0/ 0 refs1/ 5 xio-2/-2 (syslog threshold)-1/-1 (stderr threshold)max_recent 10000max_new 1000log_file /var/log/ceph/ceph-osd.19.log--- end dump of recent events ---Since ceph-osd objdump is too large to put in a mail, I will not attach it, but if it is needed i'll find a way to share it. What might be the cause? Can any one help me with this? Thanks._________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com