Hammer OSD crash during deep scrub

Steffen Winther Soerensen <ceph.user@xxxxxxxxxx> · Wed, 17 Feb 2016 07:38:15 +0000 (UTC)

I've had few OSD crash from time to time, latest like this:

--- begin dump of recent events ---
   -12> 2016-02-15 01:28:15.386412 7f29c8828700  1 -- 10.0.3.2:6819/448052
 <== osd.17 10.0.3.1:0/6746 181211 ====
 osd_ping(ping e12542 stamp 2016-02-15 01:28:15.385759)
 v2 ==== 47+0+0 (1302847072 0 0) 0x215d8200 con 0x1bda4dc0
   -11> 2016-02-15 01:28:15.386449 7f29c8828700  1 -- 10.0.3.2:6819/448052
 --> 10.0.3.1:0/6746 -- osd_ping(ping_reply e12542 stamp
 2016-02-15 01:28:15.385759) v2 -- ?+0 0x1b805a00 con 0x1bda4dc0
   -10> 2016-02-15 01:28:15.387151 7f29ca62b700  1 -- 10.0.3.2:6820/448052
 <== osd.17 10.0.3.1:0/6746 181211 ==== osd_ping(ping e12542 stamp
 2016-02-15 01:28:15.385759) v2 ==== 47+0+0 (1302847072 0 0)
 0x21a69e00 con 0x1bd59600
    -9> 2016-02-15 01:28:15.387187 7f29ca62b700  1 -- 10.0.3.2:6820/448052
 --> 10.0.3.1:0/6746 -- osd_ping(ping_reply e12542 stamp
 2016-02-15 01:28:15.385759) v2 -- ?+0 0x1b99ba00 con 0x1bd59600
    -8> 2016-02-15 01:28:15.513752 7f29c8828700  1 -- 10.0.3.2:6819/448052
 <== osd.2 10.0.3.3:0/5787 180736 ==== osd_ping(ping e12542 stamp
 2016-02-15 01:28:15.510966) v2 ==== 47+0+0 (1623718975 0 0)
 0x7febc00 con 0x1bddc840
    -7> 2016-02-15 01:28:15.513785 7f29c8828700  1 -- 10.0.3.2:6819/448052
 --> 10.0.3.3:0/5787 -- osd_ping(ping_reply e12542 stamp
 2016-02-15 01:28:15.510966) v2 -- ?+0 0x215d8200 con 0x1bddc840
    -6> 2016-02-15 01:28:15.513943 7f29ca62b700  1 -- 10.0.3.2:6820/448052
 <== osd.2 10.0.3.3:0/5787 180736 ==== osd_ping(ping e12542 stamp
 2016-02-15 01:28:15.510966) v2 ==== 47+0+0 (1623718975 0 0)
 0x1ef38600 con 0x1bde0b00
    -5> 2016-02-15 01:28:15.514001 7f29ca62b700  1 -- 10.0.3.2:6820/448052
 --> 10.0.3.3:0/5787 -- osd_ping(ping_reply e12542 stamp
 2016-02-15 01:28:15.510966) v2 -- ?+0 0x21a69e00 con 0x1bde0b00
    -4> 2016-02-15 01:28:15.629642 7f29c8828700  1 -- 10.0.3.2:6819/448052
 <== osd.7 10.0.3.1:0/5838 180780 ==== osd_ping(ping e12542 stamp
 2016-02-15 01:28:15.628456) v2 ==== 47+0+0 (241913765 0 0)
 0x1c944c00 con 0x1b8b4160
    -3> 2016-02-15 01:28:15.629689 7f29c8828700  1 -- 10.0.3.2:6819/448052
 --> 10.0.3.1:0/5838 -- osd_ping(ping_reply e12542 stamp
 2016-02-15 01:28:15.628456) v2 -- ?+0 0x7febc00 con 0x1b8b4160
    -2> 2016-02-15 01:28:15.629667 7f29ca62b700  1 -- 10.0.3.2:6820/448052
 <== osd.7 10.0.3.1:0/5838 180780 ==== osd_ping(ping e12542 stamp
 2016-02-15 01:28:15.628456) v2 ==== 47+0+0 (241913765 0 0)
 0x1d516200 con 0x1b7ae000
    -1> 2016-02-15 01:28:15.629728 7f29ca62b700  1 -- 10.0.3.2:6820/448052
 --> 10.0.3.1:0/5838 -- osd_ping(ping_reply e12542 stamp
 2016-02-15 01:28:15.628456) v2 -- ?+0 0x1ef38600 con 0x1b7ae000
     0> 2016-02-15 01:28:15.644402 7f29b840e700 -1
 *** Caught signal (Aborted) **
 in thread 7f29b840e700

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-osd() [0xbf03dc]
 2: (()+0xf0a0) [0x7f29e4c4d0a0]
 3: (gsignal()+0x35) [0x7f29e35b7165]
 4: (abort()+0x180) [0x7f29e35ba3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f29e3e0d89d]
 6: (()+0x63996) [0x7f29e3e0b996]
 7: (()+0x639c3) [0x7f29e3e0b9c3]
 8: (()+0x63bee) [0x7f29e3e0bbee]
 9: (ceph::__ceph_assert_fail(char const*,
 char const*, int, char const*)+0x220) [0xcddda0]
 10: (FileStore::read(coll_t, ghobject_t const&, unsigned long,
 unsigned long, ceph::buffer::list&, unsigned int, bool)+0x8cb) [0xa296cb]
 11: (ReplicatedBackend::be_deep_scrub(hobject_t const&,
 unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x287) [0xb1a527]
 12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
 std::allocator<hobject_t> > const&, bool, unsigned int,
 ThreadPool::TPHandle&)+0x52c) [0x9f8ddc]
 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
 bool, unsigned int, ThreadPool::TPHandle&)+0x124) [0x910ee4]
 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x481)
 [0x9116d1]
 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xf4)
 [0x8119f4]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccfd69]
 17: (ThreadPool::WorkThread::entry()+0x10) [0xcd0f70]
 18: (()+0x6b50) [0x7f29e4c44b50]
 19: (clone()+0x6d) [0x7f29e366095d]

Looks like an IO error during read maybe,
only nothing logged in syslog messages at the time.
But current this drive shows predictive error status
in the raid crontoller, so maybe...

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com