Re: Hammer OSD crash during deep scrub

Maksym Krasilnikov <pseudo@xxxxxxxxxxxx> · Wed, 17 Feb 2016 11:14:09 +0200

Hello!

On Wed, Feb 17, 2016 at 07:38:15AM +0000, ceph.user wrote:

>  ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>  1: /usr/bin/ceph-osd() [0xbf03dc]
>  2: (()+0xf0a0) [0x7f29e4c4d0a0]
>  3: (gsignal()+0x35) [0x7f29e35b7165]
>  4: (abort()+0x180) [0x7f29e35ba3e0]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f29e3e0d89d]
>  6: (()+0x63996) [0x7f29e3e0b996]
>  7: (()+0x639c3) [0x7f29e3e0b9c3]
>  8: (()+0x63bee) [0x7f29e3e0bbee]
>  9: (ceph::__ceph_assert_fail(char const*,
>  char const*, int, char const*)+0x220) [0xcddda0]
>  10: (FileStore::read(coll_t, ghobject_t const&, unsigned long,
>  unsigned long, ceph::buffer::list&, unsigned int, bool)+0x8cb) [0xa296cb]
>  11: (ReplicatedBackend::be_deep_scrub(hobject_t const&,
>  unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x287) [0xb1a527]
>  12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
>  std::allocator<hobject_t> > const&, bool, unsigned int,
>  ThreadPool::TPHandle&)+0x52c) [0x9f8ddc]
>  13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
>  bool, unsigned int, ThreadPool::TPHandle&)+0x124) [0x910ee4]
>  14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x481)
>  [0x9116d1]
>  15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xf4)
>  [0x8119f4]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccfd69]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0xcd0f70]
>  18: (()+0x6b50) [0x7f29e4c44b50]
>  19: (clone()+0x6d) [0x7f29e366095d]

> Looks like an IO error during read maybe,
> only nothing logged in syslog messages at the time.
> But current this drive shows predictive error status
> in the raid crontoller, so maybe...

I have issue like yours:

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (()+0x6149ea) [0x55944d5669ea]
 2: (()+0x10340) [0x7f6ff4271340]
 3: (gsignal()+0x39) [0x7f6ff2710cc9]
 4: (abort()+0x148) [0x7f6ff27140d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f6ff301b535]
 6: (()+0x5e6d6) [0x7f6ff30196d6]
 7: (()+0x5e703) [0x7f6ff3019703]
 8: (()+0x5e922) [0x7f6ff3019922]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x55944d65f368]
 10: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x55944d2d0306]
 11: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair<unsigned int, unsigned int>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::pair<unsigned int, unsigne
d int> > > > const&)+0xa1c) [0x55944d3af3cc]
 12: (PG::scrub_compare_maps()+0xec9) [0x55944d31ed19]
 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x55944d321dce]
 14: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x55944d32374e]
 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x55944d207fa9]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x55944d64fd66]
 17: (ThreadPool::WorkThread::entry()+0x10) [0x55944d650e10]
 18: (()+0x8182) [0x7f6ff4269182]
 19: (clone()+0x6d) [0x7f6ff27d447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This appears when scrubbing, deep scrubbing or repairing PG 5.ca. I can reproduce it anytime.

I tried to remove and re-create OSD, but it did not help.

Now I'm going to check OSD filesystem. But I have neither strange logs in syslog, nor SMART reports about this drive.

-- 
WBR, Max A. Krasilnikov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com