Re: Infinite degraded objects

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Thu, 26 Oct 2017 10:43:49 +1300

Well, there were a few bug logged around upgraded which hit a similar
assert but those were fixed 2 years ago supposedly. Looks like Ubuntu
15.04 shipped Hammer (0.94.5) so presumably that's what you upgraded
from.
The current Jewel release is 10.2.10 - I don't know if the problem
you're seeing is fixed in there but I'd upgrade to 10.2.10 and then
open a tracker ticket if the problem still persists.

On Thu, Oct 26, 2017 at 9:13 AM, Gonzalo Aguilar Delgado
<gaguilar@xxxxxxxxxxxxxxxxxx> wrote:
> Hello,
>
> I cannot tell what was the previous version since I used the one installed
> on ubuntu 15.04. Now 16.04.
>
> But what I can tell is that I get errors from ceph osd and mon from time to
> time. The mon problems are scaring since I have to wipe the monitor and then
> reinstall a new one. I cannot really understand what's going on. I have
> never so many problems like after updating.
>
> Should I open a bug report?
>
>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x80) [0x55d5d510b250]
>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2]
>  3: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a]
>  4: (OSD::init()+0x2026) [0x55d5d4a3ec46]
>  5: (main()+0x2d6b) [0x55d5d49b193b]
>  6: (__libc_start_main()+0xf0) [0x7f49d02e5830]
>  7: (_start()+0x29) [0x55d5d49f28c9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 newstore
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    1/ 5 kinetic
>    1/ 5 fuse
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.3.log
> --- end dump of recent events ---
> 2017-10-25 22:09:58.778107 7f49d36958c0 -1 *** Caught signal (Aborted) **
>  in thread 7f49d36958c0 thread_name:ceph-osd
>
>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>  1: (()+0x9616ee) [0x55d5d500b6ee]
>  2: (()+0x11390) [0x7f49d235e390]
>  3: (gsignal()+0x38) [0x7f49d02fa428]
>  4: (abort()+0x16a) [0x7f49d02fc02a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x26b) [0x55d5d510b43b]
>  6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2]
>  7: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a]
>  8: (OSD::init()+0x2026) [0x55d5d4a3ec46]
>  9: (main()+0x2d6b) [0x55d5d49b193b]
>  10: (__libc_start_main()+0xf0) [0x7f49d02e5830]
>  11: (_start()+0x29) [0x55d5d49f28c9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> --- begin dump of recent events ---
>      0> 2017-10-25 22:09:58.778107 7f49d36958c0 -1 *** Caught signal
> (Aborted) **
>  in thread 7f49d36958c0 thread_name:ceph-osd
>
>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>  1: (()+0x9616ee) [0x55d5d500b6ee]
>  2: (()+0x11390) [0x7f49d235e390]
>  3: (gsignal()+0x38) [0x7f49d02fa428]
>  4: (abort()+0x16a) [0x7f49d02fc02a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x26b) [0x55d5d510b43b]
>  6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2]
>  7: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a]
>  8: (OSD::init()+0x2026) [0x55d5d4a3ec46]
>  9: (main()+0x2d6b) [0x55d5d49b193b]
>  10: (__libc_start_main()+0xf0) [0x7f49d02e5830]
>  11: (_start()+0x29) [0x55d5d49f28c9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> --- logging levels ---
>    0/ 5 none
>    0/ 1 lockdep
>    0/ 1 context
>    1/ 1 crush
>    1/ 5 mds
>    1/ 5 mds_balancer
>    1/ 5 mds_locker
>    1/ 5 mds_log
>    1/ 5 mds_log_expire
>    1/ 5 mds_migrator
>    0/ 1 buffer
>    0/ 1 timer
>    0/ 1 filer
>    0/ 1 striper
>    0/ 1 objecter
>    0/ 5 rados
>    0/ 5 rbd
>    0/ 5 rbd_mirror
>    0/ 5 rbd_replay
>    0/ 5 journaler
>    0/ 5 objectcacher
>    0/ 5 client
>    0/ 5 osd
>    0/ 5 optracker
>    0/ 5 objclass
>    1/ 3 filestore
>    1/ 3 journal
>    0/ 5 ms
>    1/ 5 mon
>    0/10 monc
>    1/ 5 paxos
>    0/ 5 tp
>    1/ 5 auth
>    1/ 5 crypto
>    1/ 1 finisher
>    1/ 5 heartbeatmap
>    1/ 5 perfcounter
>    1/ 5 rgw
>    1/10 civetweb
>    1/ 5 javaclient
>    1/ 5 asok
>    1/ 1 throttle
>    0/ 0 refs
>    1/ 5 xio
>    1/ 5 compressor
>    1/ 5 newstore
>    1/ 5 bluestore
>    1/ 5 bluefs
>    1/ 3 bdev
>    1/ 5 kstore
>    4/ 5 rocksdb
>    4/ 5 leveldb
>    1/ 5 kinetic
>    1/ 5 fuse
>   -2/-2 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent     10000
>   max_new         1000
>   log_file /var/log/ceph/ceph-osd.3.log
> -
>
>
> On 25/10/17 00:42, Christian Wuerdig wrote:
>
> >From which version of ceph to which other version of ceph did you
> upgrade? Can you provide logs from crashing OSDs? The degraded object
> percentage being larger than 100% has been reported before
> (https://www.spinics.net/lists/ceph-users/msg39519.html) and looks
> like it's been fixed a week or so ago:
> http://tracker.ceph.com/issues/21803
>
> On Mon, Oct 23, 2017 at 5:10 AM, Gonzalo Aguilar Delgado
> <gaguilar@xxxxxxxxxxxxxxxxxx> wrote:
>
> Hello,
>
> Since we upgraded ceph cluster we are facing a lot of problems. Most of them
> due to osd crashing. What can cause this?
>
>
> This morning I woke up with thi message:
>
>
> root@red-compute:~# ceph -w
>     cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>      health HEALTH_ERR
>             1 pgs are stuck inactive for more than 300 seconds
>             7 pgs inconsistent
>             1 pgs stale
>             1 pgs stuck stale
>             recovery 20266198323167232/287940 objects degraded
> (7038340738753.641%)
>             37154696925806626 scrub errors
>             too many PGs per OSD (305 > max 300)
>      monmap e12: 2 mons at
> {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
>             election epoch 4986, quorum 0,1 red-compute,blue-compute
>       fsmap e913: 1/1/1 up {0=blue-compute=up:active}
>      osdmap e8096: 5 osds: 5 up, 5 in
>             flags require_jewel_osds
>       pgmap v68755349: 764 pgs, 6 pools, 558 GB data, 140 kobjects
>             1119 GB used, 3060 GB / 4179 GB avail
>             20266198323167232/287940 objects degraded (7038340738753.641%)
>                  756 active+clean
>                    7 active+clean+inconsistent
>                    1 stale+active+clean
>   client io 1630 B/s rd, 552 kB/s wr, 0 op/s rd, 64 op/s wr
>
> 2017-10-22 18:10:13.000812 mon.0 [INF] pgmap v68755348: 764 pgs: 7
> active+clean+inconsistent, 756 active+clean, 1 stale+active+clean; 558 GB
> data, 1119 GB used, 3060 GB / 4179 GB avail; 1641 B/s rd, 229 kB/s wr, 39
> op/s; 20266198323167232/287940 objects degraded (7038340738753.641%)
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com