And so, how can I temporarily disable that OSD without blocking all the cluster ? A balance will take hours, and everything will be unaivable during that time. Since there is replication, can I instruct clients to use only replicas ? And that IO error, came from the journal (SSD) or storage (HDD) ? Thanks, Olivier B. Le dimanche 10 février 2013 à 20:52 +0100, Olivier Bonvalet a écrit : > Wow... all that verbose stuff is «just» a read error ? > > But ok, I will continue on the ceph-users list then ;) > > > Le dimanche 10 février 2013 à 10:10 -0800, Gregory Farnum a écrit : > > The OSD daemon is getting back EIO when it tries to do a read. Sounds like your disk is going bad. > > -Greg > > > > PS: This question is a good fit for the new ceph-users list. :) > > > > > > On Sunday, February 10, 2013 at 9:45 AM, Olivier Bonvalet wrote: > > > > > Hi, > > > > > > I have an OSD which often stopped (ceph 0.56.2), with that in logs : > > > > > > 446 stamp 2013-02-10 18:37:27.559777) v2 ==== 47+0+0 (4068038983 0 0) > > > 0x11e028c0 con 0x573d6e0 > > > -3> 2013-02-10 18:37:27.561618 7f1c765d5700 1 -- > > > 192.168.42.1:0/5824 <== osd.31 192.168.42.3:6811/23050 129 ==== > > > osd_ping(ping_reply e13446 stamp 2013-02-10 18:37:27.559777) v2 ==== 47 > > > +0+0 (4068038983 0 0) 0x73be380 con 0x573d420 > > > -2> 2013-02-10 18:37:27.562674 7f1c765d5700 1 -- > > > 192.168.42.1:0/5824 <== osd.1 192.168.42.2:6803/7458 129 ==== > > > osd_ping(ping_reply e13446 stamp 2013-02-10 18:37:27.559777) v2 ==== 47 > > > +0+0 (4068038983 0 0) 0x6bd8a80 con 0x573dc60 > > > -1> 2013-02-10 18:37:28.217626 7f1c805e9700 5 osd.12 13444 tick > > > 0> 2013-02-10 18:37:28.552692 7f1c725cd700 -1 os/FileStore.cc (http://FileStore.cc): In > > > function 'virtual int FileStore::read(coll_t, const hobject_t&, > > > uint64_t, size_t, ceph::bufferlist&)' thread 7f1c725cd700 time > > > 2013-02-10 18:37:28.537715 > > > os/FileStore.cc (http://FileStore.cc): 2732: FAILED assert(!m_filestore_fail_eio || got != -5) > > > > > > ceph version () > > > 1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned > > > long, ceph::buffer::list&)+0x462) [0x725f92] > > > 2: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, > > > std::allocator<hobject_t> >&, bool)+0x371) [0x685da1] > > > 3: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, > > > bool)+0x29b) [0x6866bb] > > > 4: (PG::replica_scrub(MOSDRepScrub*)+0x8e9) [0x6952b9] > > > 5: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0xc2) [0x6410a2] > > > 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x80f9e9] > > > 7: (ThreadPool::WorkThread::entry()+0x10) [0x8121f0] > > > 8: (()+0x68ca) [0x7f1c852f48ca] > > > 9: (clone()+0x6d) [0x7f1c83e23b6d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > > > needed to interpret this. > > > > > > --- logging levels --- > > > 0/ 5 none > > > 0/ 1 lockdep > > > 0/ 1 context > > > 1/ 1 crush > > > 1/ 5 mds > > > 1/ 5 mds_balancer > > > 1/ 5 mds_locker > > > 1/ 5 mds_log > > > 1/ 5 mds_log_expire > > > 1/ 5 mds_migrator > > > 0/ 1 buffer > > > 0/ 1 timer > > > 0/ 1 filer > > > 0/ 1 striper > > > 0/ 1 objecter > > > 0/ 5 rados > > > 0/ 5 rbd > > > 0/ 5 journaler > > > 0/ 5 objectcacher > > > 0/ 5 client > > > 0/ 5 osd > > > 0/ 5 optracker > > > 0/ 5 objclass > > > 1/ 3 filestore > > > 1/ 3 journal > > > 0/ 5 ms > > > 1/ 5 mon > > > 0/10 monc > > > 0/ 5 paxos > > > 0/ 5 tp > > > 1/ 5 auth > > > 1/ 5 crypto > > > 1/ 1 finisher > > > 1/ 5 heartbeatmap > > > 1/ 5 perfcounter > > > 1/ 5 rgw > > > 1/ 5 hadoop > > > 1/ 5 javaclient > > > 1/ 5 asok > > > 1/ 1 throttle > > > -2/-2 (syslog threshold) > > > -1/-1 (stderr threshold) > > > max_recent 100000 > > > max_new 1000 > > > log_file /var/log/ceph/osd.12.log > > > --- end dump of recent events --- > > > 2013-02-10 18:37:29.236649 7f1c725cd700 -1 *** Caught signal (Aborted) > > > ** > > > in thread 7f1c725cd700 > > > > > > ceph version () > > > 1: /usr/bin/ceph-osd() [0x7a0db9] > > > 2: (()+0xeff0) [0x7f1c852fcff0] > > > 3: (gsignal()+0x35) [0x7f1c83d861b5] > > > 4: (abort()+0x180) [0x7f1c83d88fc0] > > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f1c8461adc5] > > > 6: (()+0xcb166) [0x7f1c84619166] > > > 7: (()+0xcb193) [0x7f1c84619193] > > > 8: (()+0xcb28e) [0x7f1c8461928e] > > > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > > const*)+0x7c9) [0x8f3fc9] > > > 10: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned > > > long, ceph::buffer::list&)+0x462) [0x725f92] > > > 11: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, > > > std::allocator<hobject_t> >&, bool)+0x371) [0x685da1] > > > 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, > > > bool)+0x29b) [0x6866bb] > > > 13: (PG::replica_scrub(MOSDRepScrub*)+0x8e9) [0x6952b9] > > > 14: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0xc2) [0x6410a2] > > > 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x80f9e9] > > > 16: (ThreadPool::WorkThread::entry()+0x10) [0x8121f0] > > > 17: (()+0x68ca) [0x7f1c852f48ca] > > > 18: (clone()+0x6d) [0x7f1c83e23b6d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > > > needed to interpret this. > > > > > > --- begin dump of recent events --- > > > -1> 2013-02-10 18:37:29.217778 7f1c805e9700 5 osd.12 13444 tick > > > 0> 2013-02-10 18:37:29.236649 7f1c725cd700 -1 *** Caught signal > > > (Aborted) ** > > > in thread 7f1c725cd700 > > > > > > ceph version () > > > 1: /usr/bin/ceph-osd() [0x7a0db9] > > > 2: (()+0xeff0) [0x7f1c852fcff0] > > > 3: (gsignal()+0x35) [0x7f1c83d861b5] > > > 4: (abort()+0x180) [0x7f1c83d88fc0] > > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f1c8461adc5] > > > 6: (()+0xcb166) [0x7f1c84619166] > > > 7: (()+0xcb193) [0x7f1c84619193] > > > 8: (()+0xcb28e) [0x7f1c8461928e] > > > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > > const*)+0x7c9) [0x8f3fc9] > > > 10: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned > > > long, ceph::buffer::list&)+0x462) [0x725f92] > > > 11: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, > > > std::allocator<hobject_t> >&, bool)+0x371) [0x685da1] > > > 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, > > > bool)+0x29b) [0x6866bb] > > > 13: (PG::replica_scrub(MOSDRepScrub*)+0x8e9) [0x6952b9] > > > 14: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0xc2) [0x6410a2] > > > 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x80f9e9] > > > 16: (ThreadPool::WorkThread::entry()+0x10) [0x8121f0] > > > 17: (()+0x68ca) [0x7f1c852f48ca] > > > 18: (clone()+0x6d) [0x7f1c83e23b6d] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > > > needed to interpret this. > > > > > > --- logging levels --- > > > 0/ 5 none > > > 0/ 1 lockdep > > > 0/ 1 context > > > 1/ 1 crush > > > 1/ 5 mds > > > 1/ 5 mds_balancer > > > 1/ 5 mds_locker > > > 1/ 5 mds_log > > > 1/ 5 mds_log_expire > > > 1/ 5 mds_migrator > > > 0/ 1 buffer > > > 0/ 1 timer > > > 0/ 1 filer > > > 0/ 1 striper > > > 0/ 1 objecter > > > 0/ 5 rados > > > 0/ 5 rbd > > > 0/ 5 journaler > > > 0/ 5 objectcacher > > > 0/ 5 client > > > 0/ 5 osd > > > 0/ 5 optracker > > > 0/ 5 objclass > > > 1/ 3 filestore > > > 1/ 3 journal > > > 0/ 5 ms > > > 1/ 5 mon > > > 0/10 monc > > > 0/ 5 paxos > > > 0/ 5 tp > > > 1/ 5 auth > > > 1/ 5 crypto > > > 1/ 1 finisher > > > 1/ 5 heartbeatmap > > > 1/ 5 perfcounter > > > 1/ 5 rgw > > > 1/ 5 hadoop > > > 1/ 5 javaclient > > > 1/ 5 asok > > > 1/ 1 throttle > > > -2/-2 (syslog threshold) > > > -1/-1 (stderr threshold) > > > max_recent 100000 > > > max_new 1000 > > > log_file /var/log/ceph/osd.12.log > > > --- end dump of recent events --- > > > > > > > > > > > > What should I do ? > > > > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx) > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com