osd dies with m_filestore_fail_eio without dmesg error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello

I have a osd that regularly dies on io, especially scrubbing.
normaly i would assume a bad disk, and replace it. but then i normaly see messages in dmesg about the device and it's errors. for this OSD
there are no errors in dmesg at all after a crash like this.

this osd is a 5 disk software raid5 array. and it have had broken disks in the past that have been replaces and parity recalculated. running XFS with a journal SSD partition.


i can start the osd again and it works for a while. (several days) before it crashes again. could one of you look at the log for this osd and see if there is any way to salvage this osd?

And is there any information i should gather before i scratch the filesystem and recreates it, perhaps there is some valuable insight into whats's going on ??

kind regards
Ronny Aasen



running debian jessie + hammer 0.94.7

# uname -a
Linux ceph-osd2 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux

# ceph --version
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)


/dev/md123p1     11T  8.5T  2.5T  78% /var/lib/ceph/osd/ceph-109


md123 : active raid5 sdw[0] sdaa[5] sdz[3] sdy[2] sdx[1]
11720540160 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 0/22 pages [0KB], 65536KB chunk




-- logdump osd.106 --
-8> 2016-09-05 12:09:28.145307 7eff0dbb9700 1 -- 10.24.12.22:6806/7970 <== osd.57 10.24.11.23:0/3554 39827 ==== osd_ping(ping e106009 stamp 2016-09-05 12:09:28.144970) v2 ==== 47+0+0 (2908886623 0 0) 0x6a634800 con 0x6326e420 -7> 2016-09-05 12:09:28.145336 7eff0dbb9700 1 -- 10.24.12.22:6806/7970 --> 10.24.11.23:0/3554 -- osd_ping(ping_reply e106009 stamp 2016-09-05 12:09:28.144970) v2 -- ?+0 0x6874e200 con 0x6326e420 -6> 2016-09-05 12:09:28.145365 7eff0f3bc700 1 -- 10.24.11.22:6808/7970 <== osd.57 10.24.11.23:0/3554 39827 ==== osd_ping(ping e106009 stamp 2016-09-05 12:09:28.144970) v2 ==== 47+0+0 (2908886623 0 0) 0x65214000 con 0x6388b5a0 -5> 2016-09-05 12:09:28.145392 7eff0f3bc700 1 -- 10.24.11.22:6808/7970 --> 10.24.11.23:0/3554 -- osd_ping(ping_reply e106009 stamp 2016-09-05 12:09:28.144970) v2 -- ?+0 0x69095e00 con 0x6388b5a0 -4> 2016-09-05 12:09:28.185545 7eff0f3bc700 1 -- 10.24.11.22:6808/7970 <== osd.23 10.24.12.25:0/2640 39867 ==== osd_ping(ping e106009 stamp 2016-09-05 12:09:28.184760) v2 ==== 47+0+0 (2624121755 0 0) 0x690a3a00 con 0x63888c60 -3> 2016-09-05 12:09:28.185582 7eff0f3bc700 1 -- 10.24.11.22:6808/7970 --> 10.24.12.25:0/2640 -- osd_ping(ping_reply e106009 stamp 2016-09-05 12:09:28.184760) v2 -- ?+0 0x65214000 con 0x63888c60 -2> 2016-09-05 12:09:28.185956 7eff0dbb9700 1 -- 10.24.12.22:6806/7970 <== osd.23 10.24.12.25:0/2640 39867 ==== osd_ping(ping e106009 stamp 2016-09-05 12:09:28.184760) v2 ==== 47+0+0 (2624121755 0 0) 0x6510a200 con 0x63888160 -1> 2016-09-05 12:09:28.185977 7eff0dbb9700 1 -- 10.24.12.22:6806/7970 --> 10.24.12.25:0/2640 -- osd_ping(ping_reply e106009 stamp 2016-09-05 12:09:28.184760) v2 -- ?+0 0x6a634800 con 0x63888160 0> 2016-09-05 12:09:28.186884 7eff03ba5700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7eff03ba5700 time 2016-09-05 12:09:27.988279 os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc0e746] 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x9108a2] 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa21cfc] 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8d2c9a] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dfc2a] 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4ae) [0x7e03de] 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xb3) [0x6ce513]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa77) [0xbfecf7]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbffdc0]
 10: (()+0x80a4) [0x7eff360800a4]
 11: (clone()+0x6d) [0x7eff345db87d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.106.log
--- end dump of recent events ---
2016-09-05 12:09:28.308118 7eff03ba5700 -1 *** Caught signal (Aborted) **
 in thread 7eff03ba5700

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: /usr/bin/ceph-osd() [0xb0bae3]
 2: (()+0xf8d0) [0x7eff360878d0]
 3: (gsignal()+0x37) [0x7eff34528067]
 4: (abort()+0x148) [0x7eff34529448]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7eff34e15b3d]
 6: (()+0x5ebb6) [0x7eff34e13bb6]
 7: (()+0x5ec01) [0x7eff34e13c01]
 8: (()+0x5ee19) [0x7eff34e13e19]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x247) [0xc0e917] 10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x9108a2] 11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa21cfc] 12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8d2c9a] 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dfc2a] 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4ae) [0x7e03de] 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xb3) [0x6ce513]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa77) [0xbfecf7]
 17: (ThreadPool::WorkThread::entry()+0x10) [0xbffdc0]
 18: (()+0x80a4) [0x7eff360800a4]
 19: (clone()+0x6d) [0x7eff345db87d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-4> 2016-09-05 12:09:28.263346 7eff0f3bc700 1 -- 10.24.11.22:6808/7970 <== osd.28 10.24.11.25:0/2428 39715 ==== osd_ping(ping e106009 stamp 2016-09-05 12:09:28.262364) v2 ==== 47+0+0 (3122935865 0 0) 0x68964400 con 0x62949b80 -3> 2016-09-05 12:09:28.263364 7eff0dbb9700 1 -- 10.24.12.22:6806/7970 <== osd.28 10.24.11.25:0/2428 39715 ==== osd_ping(ping e106009 stamp 2016-09-05 12:09:28.262364) v2 ==== 47+0+0 (3122935865 0 0) 0x67dd9400 con 0x63270c00 -2> 2016-09-05 12:09:28.263381 7eff0f3bc700 1 -- 10.24.11.22:6808/7970 --> 10.24.11.25:0/2428 -- osd_ping(ping_reply e106009 stamp 2016-09-05 12:09:28.262364) v2 -- ?+0 0x690a3a00 con 0x62949b80 -1> 2016-09-05 12:09:28.263443 7eff0dbb9700 1 -- 10.24.12.22:6806/7970 --> 10.24.11.25:0/2428 -- osd_ping(ping_reply e106009 stamp 2016-09-05 12:09:28.262364) v2 -- ?+0 0x6510a200 con 0x63270c00 0> 2016-09-05 12:09:28.308118 7eff03ba5700 -1 *** Caught signal (Aborted) **
 in thread 7eff03ba5700

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: /usr/bin/ceph-osd() [0xb0bae3]
 2: (()+0xf8d0) [0x7eff360878d0]
 3: (gsignal()+0x37) [0x7eff34528067]
 4: (abort()+0x148) [0x7eff34529448]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7eff34e15b3d]
 6: (()+0x5ebb6) [0x7eff34e13bb6]
 7: (()+0x5ec01) [0x7eff34e13c01]
 8: (()+0x5ee19) [0x7eff34e13e19]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x247) [0xc0e917] 10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x9108a2] 11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa21cfc] 12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8d2c9a] 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dfc2a] 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4ae) [0x7e03de] 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xb3) [0x6ce513]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa77) [0xbfecf7]
 17: (ThreadPool::WorkThread::entry()+0x10) [0xbffdc0]
 18: (()+0x80a4) [0x7eff360800a4]
 19: (clone()+0x6d) [0x7eff345db87d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.106.log
--- end dump of recent events ---
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux