I take my hat off to you, well done for solving that!!! > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Zdenek Janda > Sent: 11 January 2018 13:01 > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Cluster crash - FAILED assert(interval.last > last) > > Hi, > we have restored damaged ODS not starting after bug caused by this issue, > detailed steps are for reference at > http://tracker.ceph.com/issues/21142#note-9 , should anybody hit into this this > should fix it for you. > Thanks > Zdenek Janda > > > > > On 11.1.2018 11:40, Zdenek Janda wrote: > > Hi, > > I have succeeded in identifying faulty PG: > > > > -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d > > needs 13939-15333 -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00 1 > > osd.15 15340 build_past_intervals_parallel over 13939-15333 -3448> > > 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340 > > build_past_intervals_parallel epoch 13939 -3447> 2018-01-11 > > 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map > > 13939 - loading and decoding 0x55d39deefb80 -3446> 2018-01-11 > > 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl > > 13939 27475 bytes > > -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340 > > build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting > > [21,9] up [21,9], same_interval_since = 13939 -3444> 2018-01-11 > > 11:32:20.250505 7f066e2a3e00 10 osd.15 15340 > > build_past_intervals_parallel epoch 13940 -3443> 2018-01-11 > > 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map > > 13940 - loading and decoding 0x55d39deef800 -3442> 2018-01-11 > > 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl > > 13940 27475 bytes > > .... > > -3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340 > > build_past_intervals_parallel epoch 15087 > > -2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map > > 15087 - loading and decoding 0x55d3f9e7e700 > > -1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl > > 15087 11409 bytes > > 0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1 > > /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void > > pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' > > thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716 > > /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED > > assert(interval.last > last) > > > > Lets see what can be done about this PG. > > > > Thanks > > Zdenek Janda > > > > > > On 11.1.2018 11:20, Zdenek Janda wrote: > >> Hi, > >> > >> updated the issue at http://tracker.ceph.com/issues/21142#note-5 with > >> last 10000 lines of strace before ABRT. Crash ends with: > >> > >> 0.002429 pread64(22, > >> > "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215 > >> {\354:\0\0"..., > >> 12288, 908492996608) = 12288 > >> 0.007869 pread64(22, > >> > "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215 > >> {\355:\0\0"..., > >> 12288, 908493324288) = 12288 > >> 0.004220 pread64(22, > >> > "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215 > >> {\356:\0\0"..., > >> 12288, 908499615744) = 12288 > >> 0.009143 pread64(22, > >> > "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215 > >> {\357:\0\0"..., > >> 12288, 908500926464) = 12288 > >> 0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"..., > >> 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void > >> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)' > >> thread 7fb85e234e00 time 2018-01-11 11:02:54.783628 > >> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED > >> assert(interval.last > last) > >> > >> Any suggestions are welcome, need to understand mechanism why this > >> happened > >> > >> Thanks > >> Zdenek Janda > >> > >> > >> On 11.1.2018 10:48, Josef Zelenka wrote: > >>> I have posted logs/strace from our osds with details to a ticket in > >>> the ceph bug tracker - see here > >>> http://tracker.ceph.com/issues/21142. You can see where exactly the > >>> OSDs crash etc, this can be of help if someone decides to debug it. > >>> > >>> JZ > >>> > >>> > >>> On 10/01/18 22:05, Josef Zelenka wrote: > >>>> > >>>> Hi, today we had a disasterous crash - we are running a 3 node, 24 > >>>> osd in total cluster (8 each) with SSDs for blockdb, HDD for > >>>> bluestore data. This cluster is used as a radosgw backend, for > >>>> storing a big number of thumbnails for a file hosting site - around > >>>> 110m files in total. We were adding an interface to the nodes which > >>>> required a restart, but after restarting one of the nodes, a lot of > >>>> the OSDs were kicked out of the cluster and rgw stopped working. We > >>>> have a lot of pgs down and unfound atm. OSDs can't be started(aside > >>>> from some, that's a mystery) with this error - FAILED assert ( > >>>> interval.last > > >>>> last) - they just periodically restart. So far, the cluster is > >>>> broken and we can't seem to bring it back up. We tried fscking the > >>>> osds via the ceph objectstore tool, but it was no good. The root of > >>>> all this seems to be in the FAILED assert(interval.last > last) > >>>> error, however i can't find any info regarding this or how to fix > >>>> it. Did someone here also encounter it? We're running luminous on ubuntu > 16.04. > >>>> > >>>> Thanks > >>>> > >>>> Josef Zelenka > >>>> > >>>> Cloudevelops > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com