Hey folks, I'm staring at a problem that I have found no solution for and which is causing major issues. We've had a PG go down with the first 3 OSDs all crashing and coming back only to crash again with the following error in their logs: -1> 2017-08-22 17:27:50.961633 7f4af4057700 -1 osd.1290 pg_epoch: 72946 pg[1.138s0( v 72946'430011 (62760'421568,72 946'430011] local-les=72945 n=22918 ec=764 les/c/f 72945/72881/0 72942/72944/72944) [1290,927,672,456,177,1094,194,1513 ,236,302,1326]/[1290,927,672,456,177,1094,194,2147483647,236,302,1326] r=0 lpr=72944 pi=72880-72943/24 bft=1513(7) crt= 72946'430011 lcod 72889'430010 mlcod 72889'430010 active+undersized+degraded+remapped+backfilling] recover_replicas: ob ject added to missing set for backfill, but is not in recovering, error! 0> 2017-08-22 17:27:50.965861 7f4af4057700 -1 *** Caught signal (Aborted) ** in thread 7f4af4057700 thread_name:tp_osd_tp This has been going on over the weekend when we saw a different error message before upgrading from 11.2.0 to 11.2.1. The pool is running EC 8+3. The OSDs crash with that error only to be restarted by systemd and fail again the exact same way. Eventually systemd gives, the mon_osd_down_out_interval expires and the PG just stays down+remapped while other recover and go active+clean. Can anybody help with this type of problem? Best regards, George Vasilakakos _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com