I reported this issue, if you can take a look: http://tracker.ceph.com/issues/21173 Regards Mustafa On Tue, Aug 29, 2017 at 10:44 AM, Mustafa Muhammad <mustafa1024m@xxxxxxxxx> wrote: > Hi all, > Not sure if I should open a new thread, but this is the same cluster, > so this should provide a little background. > Now the cluster is up and recovering, but we are hitting a bug that is > crashing the OSD > > 0> 2017-08-29 10:00:51.699557 7fae66139700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc: > In function 'int ECUtil::decode(const ECUtil::stripe_info_t&, > ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&, > std::map<int, ceph::buffer::list*>&)' thread 7fae66139700 time > 2017-08-29 10:00:51.688625 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc: > 59: FAILED assert(i->second.length() == total_data_size) > > Probably http://tracker.ceph.com/issues/14009 > > Some shards are problematic, smaller sizes (definitely a problem) or > last part of them is all zeros (not sure if this is padding or > problem). > > Now we have set noup, marked OSDs with corrupt chunks down, and let > the recovery proceed, but this is happening in lots of PGs and is very > slow. > Is there anything we can do to fix this faster, we tried removing the > corrupted chunk? and got this crash (I grep the thread in which Abort > happened): > > -77> 2017-08-28 15:11:40.030178 7f90cd519700 0 osd.377 pg_epoch: > 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813] > local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c > 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586) > [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174] > r=0 lpr=1102586 pi=[960339,1102586)/44 rops=1 > bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0 > active+remapped+backfilling] failed_push > 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head > from shard 548(8), reps on unfound? 0 > -2> 2017-08-28 15:11:40.130722 7f90cd519700 -1 osd.377 pg_epoch: > 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813] > local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c > 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586) > [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174] > r=0 lpr=1102586 pi=[960339,1102586)/44 > bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0 > active+remapped+backfilling] recover_replicas: object > 143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head > last_backfill 143:0d9ce1c5:::default.63296332.1__shadow_26882237.2~mGGm_A45xKldAdADFC13qizbUiC0Yrw.1_158:head > -1> 2017-08-28 15:11:40.130802 7f90cd519700 -1 osd.377 pg_epoch: > 1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813] > local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c > 1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586) > [377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174] > r=0 lpr=1102586 pi=[960339,1102586)/44 > bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0 > active+remapped+backfilling] recover_replicas: object added to missing > set for backfill, but is not in recovering, error! > 0> 2017-08-28 15:11:40.134768 7f90cd519700 -1 *** Caught signal > (Aborted) ** > in thread 7f90cd519700 thread_name:tp_osd_tp > > What we can do to fix this? > Will enabling fast_read on the pool benefit us or it is client only? > Any ideas? > > Regards > Mustafa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html