Re: Cluster crash - FAILED assert(interval.last > last)

Zdenek Janda <zdenek.janda@xxxxxxxxxxxxxxxx> · Thu, 11 Jan 2018 11:20:32 +0100

Hi,

updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
last 10000 lines of strace before ABRT. Crash ends with:

     0.002429 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\354:\0\0"...,
12288, 908492996608) = 12288
     0.007869 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\355:\0\0"...,
12288, 908493324288) = 12288
     0.004220 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\356:\0\0"...,
12288, 908499615744) = 12288
     0.009143 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\357:\0\0"...,
12288, 908500926464) = 12288
     0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
/build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
assert(interval.last > last)

Any suggestions are welcome, need to understand mechanism why this happened

Thanks
Zdenek Janda

On 11.1.2018 10:48, Josef Zelenka wrote:
> I have posted logs/strace from our osds with details to a ticket in the
> ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You
> can see where exactly the OSDs crash etc, this can be of help if someone
> decides to debug it.
> 
> JZ
> 
> 
> On 10/01/18 22:05, Josef Zelenka wrote:
>>
>> Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
>> in total cluster (8 each) with SSDs for blockdb, HDD for bluestore
>> data. This cluster is used as a radosgw backend, for storing a big
>> number of thumbnails for a file hosting site - around 110m files in
>> total. We were adding an interface to the nodes which required a
>> restart, but after restarting one of the nodes, a lot of the OSDs were
>> kicked out of the cluster and rgw stopped working. We have a lot of
>> pgs down and unfound atm. OSDs can't be started(aside from some,
>> that's a mystery) with this error -  FAILED assert ( interval.last >
>> last) - they just periodically restart. So far, the cluster is broken
>> and we can't seem to bring it back up. We tried fscking the osds via
>> the ceph objectstore tool, but it was no good. The root of all this
>> seems to be in the FAILED assert(interval.last > last) error, however
>> i can't find any info regarding this or how to fix it. Did someone
>> here also encounter it? We're running luminous on ubuntu 16.04.
>>
>> Thanks
>>
>> Josef Zelenka
>>
>> Cloudevelops
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com