Re: [Help: pool not responding] Now osd crash

Mario Giammarco <mgiammarco@xxxxxxxxx> · Wed, 9 Mar 2016 00:11:27 +0100

Hello,probably I have restarted osd too many times or I have put in/out osd too many times but now I get this:

root@proxmox-zotac:~# /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph -f   
starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal

osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f7fd358e880 time 2016-03-09 00:08:09.193975

osd/PG.cc: 2868: FAILED assert(r > 0)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc03c46]

 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x4ab) [0x7c616b]

 3: (OSD::load_pgs()+0xa20) [0x6a9170]

 4: (OSD::init()+0xc84) [0x6ac204]

 5: (main()+0x2839) [0x632459]

 6: (__libc_start_main()+0xf5) [0x7f7fd08b3b45]

 7: /usr/bin/ceph-osd() [0x64c087]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2016-03-09 00:08:09.196669 7f7fd358e880 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f7fd358e880 time 2016-03-09 00:08:09.193975

osd/PG.cc: 2868: FAILED assert(r > 0)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc03c46]

 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x4ab) [0x7c616b]

 3: (OSD::load_pgs()+0xa20) [0x6a9170]

 4: (OSD::init()+0xc84) [0x6ac204]

 5: (main()+0x2839) [0x632459]

 6: (__libc_start_main()+0xf5) [0x7f7fd08b3b45]

 7: /usr/bin/ceph-osd() [0x64c087]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-03-09 00:08:09.196669 7f7fd358e880 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f7fd358e880 time 2016-03-09 00:08:09.193975

osd/PG.cc: 2868: FAILED assert(r > 0)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc03c46]

 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x4ab) [0x7c616b]

 3: (OSD::load_pgs()+0xa20) [0x6a9170]

 4: (OSD::init()+0xc84) [0x6ac204]

 5: (main()+0x2839) [0x632459]

 6: (__libc_start_main()+0xf5) [0x7f7fd08b3b45]

 7: /usr/bin/ceph-osd() [0x64c087]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

terminate called after throwing an instance of 'ceph::FailedAssertion'

*** Caught signal (Aborted) **

 in thread 7f7fd358e880

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

 1: /usr/bin/ceph-osd() [0xb04503]

 2: (()+0xf8d0) [0x7f7fd24268d0]

 3: (gsignal()+0x37) [0x7f7fd08c7067]

 4: (abort()+0x148) [0x7f7fd08c8448]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f7fd11b4b3d]

 6: (()+0x5ebb6) [0x7f7fd11b2bb6]

 7: (()+0x5ec01) [0x7f7fd11b2c01]

 8: (()+0x5ee19) [0x7f7fd11b2e19]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x247) [0xc03e17]

 10: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x4ab) [0x7c616b]

 11: (OSD::load_pgs()+0xa20) [0x6a9170]

 12: (OSD::init()+0xc84) [0x6ac204]

 13: (main()+0x2839) [0x632459]

 14: (__libc_start_main()+0xf5) [0x7f7fd08b3b45]

 15: /usr/bin/ceph-osd() [0x64c087]

2016-03-09 00:08:09.203630 7f7fd358e880 -1 *** Caught signal (Aborted) **

 in thread 7f7fd358e880

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

 1: /usr/bin/ceph-osd() [0xb04503]

 2: (()+0xf8d0) [0x7f7fd24268d0]

 3: (gsignal()+0x37) [0x7f7fd08c7067]

 4: (abort()+0x148) [0x7f7fd08c8448]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f7fd11b4b3d]

 6: (()+0x5ebb6) [0x7f7fd11b2bb6]

 7: (()+0x5ec01) [0x7f7fd11b2c01]

 8: (()+0x5ee19) [0x7f7fd11b2e19]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x247) [0xc03e17]

 10: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x4ab) [0x7c616b]

 11: (OSD::load_pgs()+0xa20) [0x6a9170]

 12: (OSD::init()+0xc84) [0x6ac204]

 13: (main()+0x2839) [0x632459]

 14: (__libc_start_main()+0xf5) [0x7f7fd08b3b45]

 15: /usr/bin/ceph-osd() [0x64c087]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2016-03-09 00:08:09.203630 7f7fd358e880 -1 *** Caught signal (Aborted) **

 in thread 7f7fd358e880

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)

 1: /usr/bin/ceph-osd() [0xb04503]

 2: (()+0xf8d0) [0x7f7fd24268d0]

 3: (gsignal()+0x37) [0x7f7fd08c7067]

 4: (abort()+0x148) [0x7f7fd08c8448]

 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7f7fd11b4b3d]

 6: (()+0x5ebb6) [0x7f7fd11b2bb6]

 7: (()+0x5ec01) [0x7f7fd11b2c01]

 8: (()+0x5ee19) [0x7f7fd11b2e19]

 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x247) [0xc03e17]

 10: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x4ab) [0x7c616b]

 11: (OSD::load_pgs()+0xa20) [0x6a9170]

 12: (OSD::init()+0xc84) [0x6ac204]

 13: (main()+0x2839) [0x632459]

 14: (__libc_start_main()+0xf5) [0x7f7fd08b3b45]

 15: /usr/bin/ceph-osd() [0x64c087]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2016-03-02 9:38 GMT+01:00 Mario Giammarco <mgiammarco@xxxxxxxxx>:
Here it is:

 cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca
     health HEALTH_WARN
            4 pgs incomplete
            4 pgs stuck inactive
            4 pgs stuck unclean
            1 requests are blocked > 32 sec
     monmap e8: 3 mons at {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0}
            election epoch 840, quorum 0,1,2 0,1,2
     osdmap e2405: 3 osds: 3 up, 3 in
      pgmap v5904430: 288 pgs, 4 pools, 391 GB data, 100 kobjects
            1090 GB used, 4481 GB / 5571 GB avail
                 284 active+clean
                   4 incomplete
  client io 4008 B/s rd, 446 kB/s wr, 23 op/s

2016-03-02 9:31 GMT+01:00 Shinobu Kinjo <skinjo@xxxxxxxxxx>:
Is "ceph -s" still showing you same output?

>     cluster ac7bc476-3a02-453d-8e5c-606ab6f022ca

>      health HEALTH_WARN

>             4 pgs incomplete

>             4 pgs stuck inactive

>             4 pgs stuck unclean

>      monmap e8: 3 mons at

> {0=10.1.0.12:6789/0,1=10.1.0.14:6789/0,2=10.1.0.17:6789/0}

>             election epoch 832, quorum 0,1,2 0,1,2

>      osdmap e2400: 3 osds: 3 up, 3 in

>       pgmap v5883297: 288 pgs, 4 pools, 391 GB data, 100 kobjects

>             1090 GB used, 4481 GB / 5571 GB avail

>                  284 active+clean

>                    4 incomplete

Cheers,

S

----- Original Message -----

From: "Mario Giammarco" <mgiammarco@xxxxxxxxx>

To: "Lionel Bouton" <lionel-subscription@xxxxxxxxxxx>

Cc: "Shinobu Kinjo" <skinjo@xxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx

Sent: Wednesday, March 2, 2016 4:27:15 PM

Subject: Re:  Help: pool not responding

Tried to set min_size=1 but unfortunately nothing has changed.

Thanks for the idea.

2016-02-29 22:56 GMT+01:00 Lionel Bouton <lionel-subscription@xxxxxxxxxxx>:

> Le 29/02/2016 22:50, Shinobu Kinjo a écrit :

>

> the fact that they are optimized for benchmarks and certainly not

> Ceph OSD usage patterns (with or without internal journal).

>

> Are you assuming that SSHD is causing the issue?

> If you could elaborate on this more, it would be helpful.

>

>

> Probably not (unless they reveal themselves extremely unreliable with Ceph

> OSD usage patterns which would be surprising to me).

>

> For incomplete PG the documentation seems good enough for what should be

> done :

> http://docs.ceph.com/docs/master/rados/operations/pg-states/

>

> The relevant text:

>

> *Incomplete* Ceph detects that a placement group is missing information

> about writes that may have occurred, or does not have any healthy copies.

> If you see this state, try to start any failed OSDs that may contain the

> needed information or temporarily adjust min_size to allow recovery.

>

> We don't have the full history but the most probable cause of these

> incomplete PGs is that min_size is set to 2 or 3 and at some time the 4

> incomplete pgs didn't have as many replica as the min_size value. So if

> setting min_size to 2 isn't enough setting it to 1 should unfreeze them.

>

> Lionel

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com