Re: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> · Mon, 19 Aug 2019 01:57:58 -0400

This sounds familiar. Do any of these pools on the SSD have fairly dense placement group to object ratios? Like more than 500k objects per pg? (ceph pg ls)

On Sun, Aug 18, 2019, 10:12 PM Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
On Thu, Aug 15, 2019 at 2:09 AM Troy Ablan <tablan@xxxxxxxxx> wrote:

>

> Paul,

>

> Thanks for the reply.  All of these seemed to fail except for pulling

> the osdmap from the live cluster.

>

> -Troy

>

> -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path

> /var/lib/ceph/osd/ceph-45/ --file osdmap45

> terminate called after throwing an instance of

> 'ceph::buffer::malformed_input'

>    what():  buffer::malformed_input: unsupported bucket algorithm: -1

> *** Caught signal (Aborted) **

>   in thread 7f945ee04f00 thread_name:ceph-objectstor

>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic

> (stable)

>   1: (()+0xf5d0) [0x7f94531935d0]

>   2: (gsignal()+0x37) [0x7f9451d80207]

>   3: (abort()+0x148) [0x7f9451d818f8]

>   4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f945268f7d5]

>   5: (()+0x5e746) [0x7f945268d746]

>   6: (()+0x5e773) [0x7f945268d773]

>   7: (__cxa_rethrow()+0x49) [0x7f945268d9e9]

>   8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8)

> [0x7f94553218d8]

>   9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f94550ff4ad]

>   10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9455101db1]

>   11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,

> ceph::buffer::list&)+0x1d0) [0x55de1f9a6e60]

>   12: (main()+0x5340) [0x55de1f8c8870]

>   13: (__libc_start_main()+0xf5) [0x7f9451d6c3d5]

>   14: (()+0x3adc10) [0x55de1f9a1c10]

> Aborted

>

> -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path

> /var/lib/ceph/osd/ceph-46/ --file osdmap46

> terminate called after throwing an instance of

> 'ceph::buffer::malformed_input'

>    what():  buffer::malformed_input: unsupported bucket algorithm: -1

> *** Caught signal (Aborted) **

>   in thread 7f9ce4135f00 thread_name:ceph-objectstor

>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic

> (stable)

>   1: (()+0xf5d0) [0x7f9cd84c45d0]

>   2: (gsignal()+0x37) [0x7f9cd70b1207]

>   3: (abort()+0x148) [0x7f9cd70b28f8]

>   4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f9cd79c07d5]

>   5: (()+0x5e746) [0x7f9cd79be746]

>   6: (()+0x5e773) [0x7f9cd79be773]

>   7: (__cxa_rethrow()+0x49) [0x7f9cd79be9e9]

>   8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8)

> [0x7f9cda6528d8]

>   9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f9cda4304ad]

>   10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9cda432db1]

>   11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,

> ceph::buffer::list&)+0x1d0) [0x55cea26c8e60]

>   12: (main()+0x5340) [0x55cea25ea870]

>   13: (__libc_start_main()+0xf5) [0x7f9cd709d3d5]

>   14: (()+0x3adc10) [0x55cea26c3c10]

> Aborted

>

> -[~:#]- ceph osd getmap -o osdmap

> got osdmap epoch 81298

>

> -[~:#]- ceph-objectstore-tool --op set-osdmap --data-path

> /var/lib/ceph/osd/ceph-46/ --file osdmap

> osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.

>

> -[~:#]- ceph-objectstore-tool --op set-osdmap --data-path

> /var/lib/ceph/osd/ceph-45/ --file osdmap

> osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.

819   auto ch = store->open_collection(coll_t::meta());

 820   const ghobject_t full_oid = OSD::get_osdmap_pobject_name(e);

 821   if (!store->exists(ch, full_oid)) {

 822     cerr << "osdmap (" << full_oid << ") does not exist." << std::endl;

 823     if (!force) {

 824       return -ENOENT;

 825     }

 826     cout << "Creating a new epoch." << std::endl;

 827   }

Adding "--force"should get you past that error.

>

>

>

> On 8/14/19 2:54 AM, Paul Emmerich wrote:

> > Starting point to debug/fix this would be to extract the osdmap from

> > one of the dead OSDs:

> >

> > ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/...

> >

> > Then try to run osdmaptool on that osdmap to see if it also crashes,

> > set some --debug options (don't know which one off the top of my

> > head).

> > Does it also crash? How does it differ from the map retrieved with

> > "ceph osd getmap"?

> >

> > You can also set the osdmap with "--op set-osdmap", does it help to

> > set the osdmap retrieved by "ceph osd getmap"?

> >

> > Paul

> >

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Cheers,

Brad

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com