Hi all, Not sure if I should open a new thread, but this is the same cluster, so this should provide a little background. Now the cluster is up and recovering, but we are hitting a bug that is crashing the OSD 0> 2017-08-29 10:00:51.699557 7fae66139700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc: In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, std::map<int, ceph::buffer::list*>&)' thread 7fae66139700 time 2017-08-29 10:00:51.688625 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc: 59: FAILED assert(i->second.length() == total_data_size) Probably http://tracker.ceph.com/issues/14009 Some shards are problematic, smaller sizes (definitely a problem) or last part of them is all zeros (not sure if this is padding or problem). Now we have set noup, marked OSDs with corrupt chunks down, and let the recovery proceed, but this is happening in lots of PGs and is very slow. Is there anything we can do to fix this faster, we tried removing the corrupted chunk? and got this crash (I grep the thread in which Abort happened): -77> 2017-08-28 15:11:40.030178 7f90cd519700 0 osd.377 pg_epoch: 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813] local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586) [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174] r=0 lpr=1102586 pi=[960339,1102586)/44 rops=1 bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0 active+remapped+backfilling] failed_push 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head from shard 548(8), reps on unfound? 0 -2> 2017-08-28 15:11:40.130722 7f90cd519700 -1 osd.377 pg_epoch: 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813] local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586) [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174] r=0 lpr=1102586 pi=[960339,1102586)/44 bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0 active+remapped+backfilling] recover_replicas: object 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head last_backfill 143:0d9ce1c5:::default.63296332.1__shadow_26882237.2~mGGm_A45xKldAdADFC13qizbUiC0Yrw.1_158:head -1> 2017-08-28 15:11:40.130802 7f90cd519700 -1 osd.377 pg_epoch: 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813] local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586) [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174] r=0 lpr=1102586 pi=[960339,1102586)/44 bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0 active+remapped+backfilling] recover_replicas: object added to missing set for backfill, but is not in recovering, error! 0> 2017-08-28 15:11:40.134768 7f90cd519700 -1 *** Caught signal (Aborted) ** in thread 7f90cd519700 thread_name:tp_osd_tp What we can do to fix this? Will enabling fast_read on the pool benefit us or it is client only? Any ideas? Regards Mustafa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html