Re: Osd FAILED assert(p.same_interval_since)

Dejan Lesjak <dejan.lesjak@xxxxxx> · Mon, 16 Oct 2017 17:24:14 +0200



On 10/16/2017 02:02 PM, Dejan Lesjak wrote:
> Hi,
> 
> During rather high load and rebalancing, a couple of our OSDs crashed
> and they fail to start. This is from the log:
> 
>     -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123 load_pgs
> opened 370 pgs
>     -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
> build_past_intervals_parallel over 439159-439159
>      0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> In function 'void OSD::build_past_intervals_parallel()' thread
> 7f5e4c3bae80 time 2017-10-16 13:27:50.260062
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> 4177: FAILED assert(p.same_interval_since)
> 
>  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x55e4caa18592]
>  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
>  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
>  4: (OSD::init()+0x2227) [0x55e4ca467327]
>  5: (main()+0x2d5a) [0x55e4ca379b1a]
>  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
>  7: (_start()+0x2a) [0x55e4ca4039aa]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> Does anybody know how to fix or further debug this?

Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
>From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
10.1fce. Yet pg map doesn't show osd.1 for this pg:

# ceph pg map 10.1fce
osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
[110,213,132,182]


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com