Re: Help! OSDs across the cluster just crashed

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 3 Oct 2018 08:49:05 -0700

Yeah, don't run these commands blind. They are changing the local metadata of the PG in ways that may make it inconsistent with the overall cluster and result in lost data.
Brett, it seems this issue has come up several times in the field but we haven't been able to reproduce it locally or get enough info to debug what's going on: https://tracker.ceph.com/issues/21142
Maybe run through that ticket and see if you can contribute new logs or add detail about possible sources?
-Greg

On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim <goktug.yildirim@xxxxxxxxx> wrote:
Hi,

Sorry to hear that. I’ve been battling with mine for 2 weeks :/

I’ve corrected mine OSDs with the following commands. My OSD logs (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG number besides and before crash dump.

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op trim-pg-log --pgid $2

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op fix-lost --pgid $2

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair --pgid $2

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op mark-complete --pgid $2

systemctl restart ceph-osd@$1

I dont know if it works for you but it may be no harm to try for an OSD.

There is such less information about this tools. So it might be risky. I hope someone much experienced could help more.

> On 2 Oct 2018, at 23:23, Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> wrote:

> 

> Help. I have a 60 node cluster and most of the OSDs decided to crash themselves at the same time. They wont restart, the messages look like...

> 

> --- begin dump of recent events ---

>      0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal (Aborted) **

>  in thread 7f57ab5b7d80 thread_name:ceph-osd

> 

>  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)

>  1: (()+0xa3c611) [0x556d618bb611]

>  2: (()+0xf6d0) [0x7f57a885e6d0]

>  3: (gsignal()+0x37) [0x7f57a787f277]

>  4: (abort()+0x148) [0x7f57a7880968]

>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x556d618fa6e4]

>  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x556d615c74a2]

>  7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x556d615ae6c0]

>  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]

>  9: (OSD::load_pgs()+0x545) [0x556d61373095]

>  10: (OSD::init()+0x2169) [0x556d613919d9]

>  11: (main()+0x2d07) [0x556d61295dd7]

>  12: (__libc_start_main()+0xf5) [0x7f57a786b445]

>  13: (()+0x4b53e3) [0x556d613343e3]

>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

> 

> 

> Some hosts have no working OSDs, others seem to have 1 working, and 2 dead.  It's spread all across the cluster, across several different racks. Any idea on where to look next? The cluster is dead in the water right now.

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com