Re: Luminous OSD crashes every few seconds: FAILED assert(0 == "past_interval end mismatch")

J David <j.david.lists@xxxxxxxxx> · Wed, 1 Aug 2018 21:55:46 -0400

It seems I got around this issue with the following process.

1. I noted from the error that the pg causing the problem was 2.621.

2. I did “ceph pg 2.621 query” and I saw that that pg had nothing
whatsoever to do with the affected OSD.

3. I looked in the /var/lib/ceph/osd/ceph-14/current directory for
things related to 2.621.  I found 2.621_head, which had a zero-length
file inside called __head_00000621__2, and an empty directory called
2.621_TEMP.

4. I blew away both directories.

This allowed the affected OSD to start and, after several minutes, it
rejoined the cluster where its peers seemed very happy to see it
again.

Not sure if this solution works generally or if it was specific to
this case, or if it was not a solution and the cluster will eat itself
overnight.  But, so far so good!

Thanks!

On Wed, Aug 1, 2018 at 3:42 PM, J David <j.david.lists@xxxxxxxxx> wrote:
> Hello all,
>
> On Luminous 12.2.7, during the course of recovering from a failed OSD,
> one of the other OSDs started repeatedly crashing every few seconds
> with an assertion failure:
>
> 2018-08-01 12:17:20.584350 7fb50eded700 -1 log_channel(cluster) log
> [ERR] : 2.621 past_interal bound [19300,21449) end does not match
> required [21374,21447) end
> /build/ceph-12.2.7/src/osd/PG.cc: In function 'void
> PG::check_past_interval_bounds() const' thread 7fb50eded700 time
> 2018-08-01 12:17:20.584367
> /build/ceph-12.2.7/src/osd/PG.cc: 847: FAILED assert(0 ==
> “past_interval end mismatch")
>
> The console output of a run of this OSD is here:
>
> https://pastebin.com/WSjsVwVu
>
> The last 512k worth of the log file for this OSD is here:
>
> https://pastebin.com/rYQkMatA
>
> Currently I have “debug osd = 5/5” in ceph.conf, but if other values
> would shed useful light, this problem  is easy to reproduce.
>
> There are no disk errors or problems that I can see with the OSD that
> won’t stay running.
>
> Does anyone know what happened here, and whether it's recoverable?
>
> Thanks for any advice!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com