> On 17. okt. 2017, at 00:23, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak <dejan.lesjak@xxxxxx> wrote: > On 10/16/2017 02:02 PM, Dejan Lesjak wrote: > > Hi, > > > > During rather high load and rebalancing, a couple of our OSDs crashed > > and they fail to start. This is from the log: > > > > -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80 0 osd.1 442123 load_pgs > > opened 370 pgs > > -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80 1 osd.1 442123 > > build_past_intervals_parallel over 439159-439159 > > 0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1 > > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc: > > In function 'void OSD::build_past_intervals_parallel()' thread > > 7f5e4c3bae80 time 2017-10-16 13:27:50.260062 > > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc: > > 4177: FAILED assert(p.same_interval_since) > > > > ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous > > (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > const*)+0x102) [0x55e4caa18592] > > 2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b] > > 3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b] > > 4: (OSD::init()+0x2227) [0x55e4ca467327] > > 5: (main()+0x2d5a) [0x55e4ca379b1a] > > 6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1] > > 7: (_start()+0x2a) [0x55e4ca4039aa] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > > needed to interpret this. > > > > Does anybody know how to fix or further debug this? > > Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt > From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg > 10.1fce. Yet pg map doesn't show osd.1 for this pg: > > # ceph pg map 10.1fce > osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting > [110,213,132,182] > > Hmm, this is odd. What caused your rebalancing exactly? Can you turn on the OSD with debugging set to 20, and then upload the log file using ceph-post-file? > > The specific assert you're hitting here is supposed to cope with PGs that have been imported (via the ceph-objectstore-tool). But obviously something has gone wrong here. It started when we bumped the number of PGs for a pool (from 2048 to 8192). I’ve sent the log with ID 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba It actually seems similar than http://tracker.ceph.com/issues/21142 in that the pg found in log seems empty if checked with ceph-objectstore-tool and removing it allows the osd to start. At least on one osd, I’ve not tried that yet on all of the failed ones. Dejan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com