Re: High memory usage kills OSD while peering

Linux Chips <linux.chips@xxxxxxxxx> · Tue, 29 Aug 2017 22:49:14 +0300

hi, this would not be normally an issue.
but i think the whole thing of oom killing them and nodes dying, made 
the osds do alot of errors when writing files to disk. so we are seeing 
100s of those files till now. and not sure how much still there is to fix.
we had to do "ceph osd set pause" to keep recovery moving, other wise it 
is a mess.
i am willing to patch this if any one have a nice idea on how to deal 
with it as i am not sure what is the best to do.

my idea was (not sure how easy to implement) is to check if we have a 
size mismatch, we then grab all the chunks, and take enough shards with 
matching sizes, and decode them.
and probably mark the pg inconsistent, and let the repair deal with it 
when the pg finish recovering.

On 08/29/2017 10:34 PM, Mustafa Muhammad wrote:
I reported this issue, if you can take a look:

http://tracker.ceph.com/issues/21173

Regards
Mustafa

On Tue, Aug 29, 2017 at 10:44 AM, Mustafa Muhammad
<mustafa1024m@xxxxxxxxx> wrote:
Hi all,
Not sure if I should open a new thread, but this is the same cluster,
so this should provide a little background.
Now the cluster is up and recovering, but we are hitting a bug that is
crashing the OSD

      0> 2017-08-29 10:00:51.699557 7fae66139700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
In function 'int ECUtil::decode(const ECUtil::stripe_info_t&,
ceph::ErasureCodeInterfaceRef&, std::map<int, ceph::buffer::list>&,
std::map<int, ceph::buffer::list*>&)' thread 7fae66139700 time
2017-08-29 10:00:51.688625
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.4/rpm/el7/BUILD/ceph-12.1.4/src/osd/ECUtil.cc:
59: FAILED assert(i->second.length() == total_data_size)

Probably http://tracker.ceph.com/issues/14009

Some shards are problematic, smaller sizes (definitely a problem) or
last part of them is all zeros (not sure if this is padding or
problem).

Now we have set noup, marked OSDs with corrupt chunks down, and let
the recovery proceed, but this is happening in lots of PGs and is very
slow.
Is there anything we can do to fix this faster, we tried removing the
corrupted chunk? and got this crash (I grep the thread in which Abort
happened):

    -77> 2017-08-28 15:11:40.030178 7f90cd519700  0 osd.377 pg_epoch:
1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
[377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
r=0 lpr=1102586 pi=[960339,1102586)/44 rops=1
bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
active+remapped+backfilling] failed_push
143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
from shard 548(8), reps on  unfound? 0
     -2> 2017-08-28 15:11:40.130722 7f90cd519700 -1 osd.377 pg_epoch:
1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
[377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
r=0 lpr=1102586 pi=[960339,1102586)/44
bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
active+remapped+backfilling] recover_replicas: object
143:0d9ce204:::default.63296332.1__shadow_2033460653.2~dpBlpEu3nMuFDe6ikBFMso5ivuBb7oj.1_93:head
last_backfill 143:0d9ce1c5:::default.63296332.1__shadow_26882237.2~mGGm_A45xKldAdADFC13qizbUiC0Yrw.1_158:head
     -1> 2017-08-28 15:11:40.130802 7f90cd519700 -1 osd.377 pg_epoch:
1102631 pg[143.1b0s0( v 1098703'309813 (960110'306653,1098703'309813]
local-lis/les=1102586/1102609 n=63499 ec=470378/470378 lis/c
1102586/960364 les/c/f 1102609/960364/1061015 1102545/1102586/1102586)
[377,77,248,635,642,111,182,234,531,307,29,648]/[377,77,248,198,529,111,182,234,548,307,29,174]
r=0 lpr=1102586 pi=[960339,1102586)/44
bft=531(8),635(3),642(4),648(11) crt=1098703'309813 lcod 0'0 mlcod 0'0
active+remapped+backfilling] recover_replicas: object added to missing
set for backfill, but is not in recovering, error!
      0> 2017-08-28 15:11:40.134768 7f90cd519700 -1 *** Caught signal
(Aborted) **
in thread 7f90cd519700 thread_name:tp_osd_tp

What we can do to fix this?
Will enabling fast_read on the pool benefit us or it is client only?
Any ideas?

Regards
Mustafa

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html