Re: Help! OSDs across the cluster just crashed

Goktug Yildirim <goktug.yildirim@xxxxxxxxx> · Wed, 3 Oct 2018 00:17:56 +0200

Hi,

Sorry to hear that. I’ve been battling with mine for 2 weeks :/

I’ve corrected mine OSDs with the following commands. My OSD logs (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG number besides and before crash dump.

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op trim-pg-log --pgid $2
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op fix-lost --pgid $2
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair --pgid $2
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op mark-complete --pgid $2
systemctl restart ceph-osd@$1

I dont know if it works for you but it may be no harm to try for an OSD.

There is such less information about this tools. So it might be risky. I hope someone much experienced could help more.

> On 2 Oct 2018, at 23:23, Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> wrote:
> 
> Help. I have a 60 node cluster and most of the OSDs decided to crash themselves at the same time. They wont restart, the messages look like...
> 
> --- begin dump of recent events ---
>      0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal (Aborted) **
>  in thread 7f57ab5b7d80 thread_name:ceph-osd
> 
>  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
>  1: (()+0xa3c611) [0x556d618bb611]
>  2: (()+0xf6d0) [0x7f57a885e6d0]
>  3: (gsignal()+0x37) [0x7f57a787f277]
>  4: (abort()+0x148) [0x7f57a7880968]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x284) [0x556d618fa6e4]
>  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t const&)+0x3b2) [0x556d615c74a2]
>  7: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigned int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380) [0x556d615ae6c0]
>  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
>  9: (OSD::load_pgs()+0x545) [0x556d61373095]
>  10: (OSD::init()+0x2169) [0x556d613919d9]
>  11: (main()+0x2d07) [0x556d61295dd7]
>  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
>  13: (()+0x4b53e3) [0x556d613343e3]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> Some hosts have no working OSDs, others seem to have 1 working, and 2 dead.  It's spread all across the cluster, across several different racks. Any idea on where to look next? The cluster is dead in the water right now.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com