Re: Cluster crash - FAILED assert(interval.last > last)

Zdenek Janda <zdenek.janda@xxxxxxxxxxxxxxxx> · Thu, 11 Jan 2018 14:00:42 +0100

Hi,
we have restored damaged ODS not starting after bug caused by this
issue, detailed steps are for reference at
http://tracker.ceph.com/issues/21142#note-9 , should anybody hit into
this this should fix it for you.
Thanks
Zdenek Janda

On 11.1.2018 11:40, Zdenek Janda wrote:
> Hi,
> I have succeeded in identifying faulty PG:
> 
>  -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d
> needs 13939-15333
>  -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00  1 osd.15 15340
> build_past_intervals_parallel over 13939-15333
>  -3448> 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 13939
>  -3447> 2018-01-11 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map
> 13939 - loading and decoding 0x55d39deefb80
>  -3446> 2018-01-11 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl
> 13939 27475 bytes
>  -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting
> [21,9] up [21,9], same_interval_since = 13939
>  -3444> 2018-01-11 11:32:20.250505 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 13940
>  -3443> 2018-01-11 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map
> 13940 - loading and decoding 0x55d39deef800
>  -3442> 2018-01-11 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl
> 13940 27475 bytes
> ....
>     -3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 15087
>     -2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map
> 15087 - loading and decoding 0x55d3f9e7e700
>     -1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl
> 15087 11409 bytes
>      0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1
> /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716
> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> assert(interval.last > last)
> 
> Lets see what can be done about this PG.
> 
> Thanks
> Zdenek Janda
> 
> 
> On 11.1.2018 11:20, Zdenek Janda wrote:
>> Hi,
>>
>> updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
>> last 10000 lines of strace before ABRT. Crash ends with:
>>
>>      0.002429 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\354:\0\0"...,
>> 12288, 908492996608) = 12288
>>      0.007869 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\355:\0\0"...,
>> 12288, 908493324288) = 12288
>>      0.004220 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\356:\0\0"...,
>> 12288, 908499615744) = 12288
>>      0.009143 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\357:\0\0"...,
>> 12288, 908500926464) = 12288
>>      0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
>> 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
>> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
>> thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
>> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
>> assert(interval.last > last)
>>
>> Any suggestions are welcome, need to understand mechanism why this happened
>>
>> Thanks
>> Zdenek Janda
>>
>>
>> On 11.1.2018 10:48, Josef Zelenka wrote:
>>> I have posted logs/strace from our osds with details to a ticket in the
>>> ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You
>>> can see where exactly the OSDs crash etc, this can be of help if someone
>>> decides to debug it.
>>>
>>> JZ
>>>
>>>
>>> On 10/01/18 22:05, Josef Zelenka wrote:
>>>>
>>>> Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
>>>> in total cluster (8 each) with SSDs for blockdb, HDD for bluestore
>>>> data. This cluster is used as a radosgw backend, for storing a big
>>>> number of thumbnails for a file hosting site - around 110m files in
>>>> total. We were adding an interface to the nodes which required a
>>>> restart, but after restarting one of the nodes, a lot of the OSDs were
>>>> kicked out of the cluster and rgw stopped working. We have a lot of
>>>> pgs down and unfound atm. OSDs can't be started(aside from some,
>>>> that's a mystery) with this error -  FAILED assert ( interval.last >
>>>> last) - they just periodically restart. So far, the cluster is broken
>>>> and we can't seem to bring it back up. We tried fscking the osds via
>>>> the ceph objectstore tool, but it was no good. The root of all this
>>>> seems to be in the FAILED assert(interval.last > last) error, however
>>>> i can't find any info regarding this or how to fix it. Did someone
>>>> here also encounter it? We're running luminous on ubuntu 16.04.
>>>>
>>>> Thanks
>>>>
>>>> Josef Zelenka
>>>>
>>>> Cloudevelops
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com