OSDs in EC pool flapping

<george.vasilakakos@xxxxxxxxxx> · Tue, 22 Aug 2017 16:39:47 +0000

Hey folks,

I'm staring at a problem that I have found no solution for and which is causing major issues.
We've had a PG go down with the first 3 OSDs all crashing and coming back only to crash again with the following error in their logs:

    -1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 pg[1.138s0( v 72946'430011 (62760'421568,72
946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 72942/72944/72944) [1290,927,672,456,177,1094,194,1513
,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 lpr=72944 pi=72880-72943/24 bft=1513(7) crt=
72946'430011 lcod 72889'430010 mlcod 72889'430010 active+undersized+degraded+remapped+backfilling] recover_replicas: ob
ject added to missing set for backfill, but is not in recovering, error!
     0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) **
 in thread 7f4af4057700 thread_name:tp_osd_tp

This has been going on over the weekend when we saw a different error message before upgrading from 11.2.0 to 11.2.1.
The pool is running EC 8+3.

The OSDs crash with that error only to be restarted by systemd and fail again the exact same way. Eventually systemd gives, the mon_osd_down_out_interval expires and the PG just stays down+remapped while other recover and go active+clean.

Can anybody help with this type of problem?

Best regards,

George Vasilakakos
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com