So it seems that there is more than 1 PG with problems and something not-normal occured to the cluster. Taken as granted that your underlying storage/filesystems/networking work as expected you should check the timestamps/md5sums/attrs of the PGs' objects across the cluster and if you conclude that the PGs in the crashing OSDs are the most recent, then you can export them and reimport them to a temp OSD. If the PGs contain no data on the crashing OSDs, you also have the option to remove them and let the cluster continue. You should probably leave a message to ceph-devel too in case you hit a bug. On 19 September 2016 at 00:00, Ronny Aasen <ronny+ceph-users@xxxxxxxx> wrote: > added debug journal = 20 and got some new lines in the log. that i added to > the end of this email. > > any of you can make something out of them ? > > kind regards > Ronny Aasen > > > > > On 18.09.2016 18:59, Kostis Fardelas wrote: >> >> If you are aware of the problematic PGs and they are exportable, then >> ceph-objectstore-tool is a viable solution. If not, then running gdb >> and/or higher debug osd level logs may prove useful (to understand >> more about the problem or collect info to ask for more in ceph-devel). >> >> On 13 September 2016 at 17:26, Henrik Korkuc <lists@xxxxxxxxx> wrote: >>> >>> On 16-09-13 11:13, Ronny Aasen wrote: >>>> >>>> I suspect this must be a difficult question since there have been no >>>> replies on irc or mailinglist. >>>> >>>> assuming it's impossible to get these osd's running again. >>>> >>>> Is there a way to recover objects from the disks. ? they are mounted and >>>> data is readable. I have pg's down since they want to probe these osd's >>>> that >>>> do not want to start. >>>> >>>> pg query claim it can continue if i mark the osd as lost. but i would >>>> prefer to not loose data. especially since the data is ok and readable >>>> on >>>> the nonfunctioning osd. >>>> >>>> also let me know if there is other debug i can extract in order to >>>> troubleshoot the non starting osd's >>>> >>>> kind regards >>>> Ronny Aasen >>>> >>>> >>> I cannot help you with this, but you can try using >>> http://ceph.com/community/incomplete-pgs-oh-my/ and >>> >>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000238.html >>> (found this mail thread googling for the objectool post). ymmv >>> >>> >>>> >>>> >>>> On 12. sep. 2016 13:16, Ronny Aasen wrote: >>>>> >>>>> after adding more osd's and having a big backfill running 2 of my osd's >>>>> keep on stopping. >>>>> >>>>> We also recently upgraded from 0.94.7 to 0.94.9 but i do not know if >>>>> that is related. >>>>> >>>>> the log say. > > > > [snip old error log. ] > > -17> 2016-09-18 22:52:06.405881 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/578c53b6/rb.0.392c.238e1f29.0000000513d5/head '_' = 266 > -16> 2016-09-18 22:52:06.405915 7f878791b880 15 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/578c53b6/rb.0.392c.238e1f29.0000000513d5/21 '_' > -15> 2016-09-18 22:52:06.406049 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/578c53b6/rb.0.392c.238e1f29.0000000513d5/21 '_' = 251 > -14> 2016-09-18 22:52:06.406079 7f878791b880 15 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/4ecf13b6/rb.0.392c.238e1f29.00000037c4cb/21 '_' > -13> 2016-09-18 22:52:06.406166 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) error opening file > /var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.00000037c4c > b__21_4ECF13B6__1 with flags=2: (2) No such file or directory > -12> 2016-09-18 22:52:06.406187 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/4ecf13b6/rb.0.392c.238e1f29.00000037c4cb/21 '_' = -2 > -11> 2016-09-18 22:52:06.406190 7f878791b880 15 read_log missing > 104661'46956,1/4ecf13b6/rb.0.392c.238e 1f29.00000037c4cb/21 > -10> 2016-09-18 22:52:06.406195 7f878791b880 15 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/e85f13b6/rb.0.392c.238e1f29.000000b5bb3b/head '_' > -9> 2016-09-18 22:52:06.406279 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) error opening file > /var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.000000b5bb3 > b__head_E85F13B6__1 with flags=2: (2) No such file or directory > -8> 2016-09-18 22:52:06.406293 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/e85f13b6/rb.0.392c.238e1f29.000000b5bb3b/head '_' = -2 > -7> 2016-09-18 22:52:06.406297 7f878791b880 15 read_log missing > 104661'46955,1/e85f13b6/rb.0.392c.238e 1f29.000000b5bb3b/head > -6> 2016-09-18 22:52:06.406311 7f878791b880 15 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/e85f13b6/rb.0.392c.238e1f29.000000b5bb3b/21 '_' > -5> 2016-09-18 22:52:06.406363 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) error opening file > /var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.000000b5bb3 > b__21_E85F13B6__1 with flags=2: (2) No such file or directory > -4> 2016-09-18 22:52:06.406369 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/e85f13b6/rb.0.392c.238e1f29.000000b5bb3b/21 '_' = -2 > -3> 2016-09-18 22:52:06.406372 7f878791b880 15 read_log missing > 91332'39092,1/e85f13b6/rb.0.392c.238e1 f29.000000b5bb3b/21 > -2> 2016-09-18 22:52:06.406375 7f878791b880 15 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/d9c303b6/rb.0.392c.238e1f29.000000004943/head '_' > -1> 2016-09-18 22:52:06.426875 7f878791b880 10 > filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head > /1/d9c303b6/rb.0.392c.238e1f29.000000004943/head '_' = 266 > 0> 2016-09-18 22:52:06.455911 7f878791b880 -1 osd/PGLog.cc: In function > 'static void PGLog::read_log(O > bjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, > std::map<eversion_t, hobject_t>&, PGLog::Indexed > Log&, pg_missing_t&, std::ostringstream&, std::set<std::basic_string<char> >>*)' thread 7f878791b880 time 20 > 16-09-18 22:52:06.426909 > osd/PGLog.cc: 984: FAILED assert(oi.version == i->first) > > ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x76) [0xc0f196] > 2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t > const&, std::map<eversion_t, hobje > ct_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, > hobject_t> > >&, PGLog::IndexedLog& , > pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, > std::allocator<char> >&, std::set<s > td::string, std::less<std::string>, std::allocator<std::string> >*)+0x11ab) > [0x76f9ab] > 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x1e2) [0x7f6c72] > 4: (OSD::load_pgs()+0xac0) [0x6abd00] > 5: (OSD::init()+0x14da) [0x6af54a] > 6: (main()+0x2848) [0x6339f8] > 7: (__libc_start_main()+0xf5) [0x7f8784c3ab45] > 8: /usr/bin/ceph-osd() [0x64d687] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 20/20 osd > 0/ 5 optracker > 0/ 5 objclass > 20/20 filestore > 1/ 3 keyvaluestore > 20/20 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.106.log > --- end dump of recent events --- > 2016-09-18 22:52:06.621664 7f878791b880 -1 *** Caught signal (Aborted) ** > in thread 7f878791b880 > > ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) > 1: /usr/bin/ceph-osd() [0xb0c4d3] > 2: (()+0xf8d0) [0x7f87867ad8d0] > 3: (gsignal()+0x37) [0x7f8784c4e067] > 4: (abort()+0x148) [0x7f8784c4f448] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f878553bb3d] > 6: (()+0x5ebb6) [0x7f8785539bb6] > 7: (()+0x5ec01) [0x7f8785539c01] > 8: (()+0x5ee19) [0x7f8785539e19] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x247) [0xc0f367] > 10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t > const&, std::map<eversion_t, hobj > ect_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, > hobject_t> > >&, PGLog::IndexedLog &, pg_missing_t&, > std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >>&, std::set< std::string, > std::less<std::string>, std::allocator<std::string> >*)+0x11ab) [0x76f9ab] > 11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x1e2) [0x7f6c72] > 12: (OSD::load_pgs()+0xac0) [0x6abd00] > 13: (OSD::init()+0x14da) [0x6af54a] > 14: (main()+0x2848) [0x6339f8] > 15: (__libc_start_main()+0xf5) [0x7f8784c3ab45] > 16: /usr/bin/ceph-osd() [0x64d687] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- begin dump of recent events --- > 0> 2016-09-18 22:52:06.621664 7f878791b880 -1 *** Caught signal > (Aborted) ** > in thread 7f878791b880 > > ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) > 1: /usr/bin/ceph-osd() [0xb0c4d3] > 2: (()+0xf8d0) [0x7f87867ad8d0] > 3: (gsignal()+0x37) [0x7f8784c4e067] > 4: (abort()+0x148) [0x7f8784c4f448] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f878553bb3d] > 6: (()+0x5ebb6) [0x7f8785539bb6] > 7: (()+0x5ec01) [0x7f8785539c01] > 8: (()+0x5ee19) [0x7f8785539e19] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x247) [0xc0f367] > 10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t > const&, std::map<eversion_t, hobj > ect_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, > hobject_t> > >&, PGLog::IndexedLog &, pg_missing_t&, > std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >>&, std::set< std::string, > std::less<std::string>, std::allocator<std::string> >*)+0x11ab) [0x76f9ab] > 11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x1e2) [0x7f6c72] > 12: (OSD::load_pgs()+0xac0) [0x6abd00] > 13: (OSD::init()+0x14da) [0x6af54a] > 14: (main()+0x2848) [0x6339f8] > 15: (__libc_start_main()+0xf5) [0x7f8784c3ab45] > 16: /usr/bin/ceph-osd() [0x64d687] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- logging levels --- > 0/ 5 none > 0/ 1 lockdep > 0/ 1 context > 1/ 1 crush > 1/ 5 mds > 1/ 5 mds_balancer > 1/ 5 mds_locker > 1/ 5 mds_log > 1/ 5 mds_log_expire > 1/ 5 mds_migrator > 0/ 1 buffer > 0/ 1 timer > 0/ 1 filer > 0/ 1 striper > 0/ 1 objecter > 0/ 5 rados > 0/ 5 rbd > 0/ 5 rbd_replay > 0/ 5 journaler > 0/ 5 objectcacher > 0/ 5 client > 20/20 osd > 0/ 5 optracker > 0/ 5 objclass > 20/20 filestore > 1/ 3 keyvaluestore > 20/20 journal > 0/ 5 ms > 1/ 5 mon > 0/10 monc > 1/ 5 paxos > 0/ 5 tp > 1/ 5 auth > 1/ 5 crypto > 1/ 1 finisher > 1/ 5 heartbeatmap > 1/ 5 perfcounter > 1/ 5 rgw > 1/10 civetweb > 1/ 5 javaclient > 1/ 5 asok > 1/ 1 throttle > 0/ 0 refs > 1/ 5 xio > -2/-2 (syslog threshold) > -1/-1 (stderr threshold) > max_recent 10000 > max_new 1000 > log_file /var/log/ceph/ceph-osd.106.log > --- end dump of recent events --- > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com