Re: Cluster crash - FAILED assert(interval.last > last)

Nick Fisk <nick@xxxxxxxxxx> · Thu, 11 Jan 2018 15:38:10 -0000

I take my hat off to you, well done for solving that!!!

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Zdenek Janda
> Sent: 11 January 2018 13:01
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Cluster crash - FAILED assert(interval.last > last)
> 
> Hi,
> we have restored damaged ODS not starting after bug caused by this issue,
> detailed steps are for reference at
> http://tracker.ceph.com/issues/21142#note-9 , should anybody hit into this this
> should fix it for you.
> Thanks
> Zdenek Janda
> 
> 
> 
> 
> On 11.1.2018 11:40, Zdenek Janda wrote:
> > Hi,
> > I have succeeded in identifying faulty PG:
> >
> >  -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d
> > needs 13939-15333  -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00  1
> > osd.15 15340 build_past_intervals_parallel over 13939-15333  -3448>
> > 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13939  -3447> 2018-01-11
> > 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map
> > 13939 - loading and decoding 0x55d39deefb80  -3446> 2018-01-11
> > 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 13939 27475 bytes
> >  -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting
> > [21,9] up [21,9], same_interval_since = 13939  -3444> 2018-01-11
> > 11:32:20.250505 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13940  -3443> 2018-01-11
> > 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map
> > 13940 - loading and decoding 0x55d39deef800  -3442> 2018-01-11
> > 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 13940 27475 bytes
> > ....
> >     -3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 15087
> >     -2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map
> > 15087 - loading and decoding 0x55d3f9e7e700
> >     -1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 15087 11409 bytes
> >      0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1
> > /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> > pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> > thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716
> > /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> > assert(interval.last > last)
> >
> > Lets see what can be done about this PG.
> >
> > Thanks
> > Zdenek Janda
> >
> >
> > On 11.1.2018 11:20, Zdenek Janda wrote:
> >> Hi,
> >>
> >> updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
> >> last 10000 lines of strace before ABRT. Crash ends with:
> >>
> >>      0.002429 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\354:\0\0"...,
> >> 12288, 908492996608) = 12288
> >>      0.007869 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\355:\0\0"...,
> >> 12288, 908493324288) = 12288
> >>      0.004220 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\356:\0\0"...,
> >> 12288, 908499615744) = 12288
> >>      0.009143 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\357:\0\0"...,
> >> 12288, 908500926464) = 12288
> >>      0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
> >> 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> >> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> >> thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
> >> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> >> assert(interval.last > last)
> >>
> >> Any suggestions are welcome, need to understand mechanism why this
> >> happened
> >>
> >> Thanks
> >> Zdenek Janda
> >>
> >>
> >> On 11.1.2018 10:48, Josef Zelenka wrote:
> >>> I have posted logs/strace from our osds with details to a ticket in
> >>> the ceph bug tracker - see here
> >>> http://tracker.ceph.com/issues/21142. You can see where exactly the
> >>> OSDs crash etc, this can be of help if someone decides to debug it.
> >>>
> >>> JZ
> >>>
> >>>
> >>> On 10/01/18 22:05, Josef Zelenka wrote:
> >>>>
> >>>> Hi, today we had a disasterous crash - we are running a 3 node, 24
> >>>> osd in total cluster (8 each) with SSDs for blockdb, HDD for
> >>>> bluestore data. This cluster is used as a radosgw backend, for
> >>>> storing a big number of thumbnails for a file hosting site - around
> >>>> 110m files in total. We were adding an interface to the nodes which
> >>>> required a restart, but after restarting one of the nodes, a lot of
> >>>> the OSDs were kicked out of the cluster and rgw stopped working. We
> >>>> have a lot of pgs down and unfound atm. OSDs can't be started(aside
> >>>> from some, that's a mystery) with this error -  FAILED assert (
> >>>> interval.last >
> >>>> last) - they just periodically restart. So far, the cluster is
> >>>> broken and we can't seem to bring it back up. We tried fscking the
> >>>> osds via the ceph objectstore tool, but it was no good. The root of
> >>>> all this seems to be in the FAILED assert(interval.last > last)
> >>>> error, however i can't find any info regarding this or how to fix
> >>>> it. Did someone here also encounter it? We're running luminous on ubuntu
> 16.04.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Josef Zelenka
> >>>>
> >>>> Cloudevelops
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com