Re: High memory usage kills OSD while peering

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 24 Aug 2017 03:58:26 +0000 (UTC)

On Wed, 23 Aug 2017, Linux Chips wrote:
> On 08/23/2017 04:46 PM, Sage Weil wrote:
> > On Wed, 23 Aug 2017, Linux Chips wrote:
> > > On 08/23/2017 01:33 AM, Sage Weil wrote:
> > > > One other trick that has been used here: if you look inside the PG
> > > > directories on the OSDs and find that they are mostly empty then it's
> > > > possible some of the memory and peering overhead is related to
> > > > empty and useless PG instances on the wrong OSDs.  You can write a
> > > > script
> > > > to find empty directories (or ones that only contain the single pgmeta
> > > > object with a mostly-empty name) and remove them (using
> > > > ceph-objectstore-tool).  (For safety I'd recommend doing
> > > > ceph-objectstore-tool export first, just in case there is some useful
> > > > metadata there.)
> > > > 
> > > > That will only help if most of the pg dirs look empty, though.  If so,
> > > > it's worth a shot!
> > > > 
> > > > The other thing we once did was use a kludge patch to trim the
> > > > past_intervals metadata, which was respnosible for most of the memory
> > > > usage.  I can't tell from the profile in this thread if that is the case
> > > > or not.  There is a patch floating around in git somewhere that can be
> > > > reused if it looks like that is the thing consuming the memory.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > 
> > > we ll try the empty pg search. not sure how much is there, but i randomly
> > > checked and found a few.
> > > 
> > > as for the "kludge" patch, where can I find it. I searched the git repo,
> > > but
> > > could not identify it. did not know what to look for specifically.
> > > also, what would we need to better know if the patch would be useful?
> > > e.g. if we need another/more mem profiling.
> > 
> > I found and rebased the branch, but until we have some confidence this is
> > the problem I wouldn't use it.
> >   
> > > we installed a test cluster of 4 nodes and replicated the issue there, and
> > > we
> > > are testing various scenarios there. if any one cares to replicate it i
> > > can
> > > elaborate on the steps.
> > 
> > How were you able to reproduce the situation?
> 
> we deployed the cluster normally.
> create some profiles:
> 
> ceph osd erasure-code-profile set k3m1 k=3 m=1
> ceph osd erasure-code-profile set k9m3 k=9 m=3
> ceph osd erasure-code-profile set k6m2 k=6 m=2
> ceph osd erasure-code-profile set k12m4 k=12 m=4
> 
> then edit the rule set manually for them so the (step chooseleaf indep 0 type
> osd) instead of host. this is necessary because it is a 4 osd node cluster.
> 
> add some pools:
> 
> ceph osd pool create testk9m3 1024 1024 erasure k9m3
> ceph osd pool create testk12m4 1024 1024 erasure k12m4
> ceph osd pool create testk6m2 1024 1024 erasure k6m2
> ceph osd pool create testk3m1 1024 1024 erasure k3m1
> 
> fill it with some data using rados bench, we let it running for about 2-3
> hours, until we had about 2000+ kobject, more is better.
> 
> rados bench -p testk3m1 72000 write -t 256 -b 4K --no-cleanup
> rados bench -p testk6m2 72000 write -t 256 -b 4K --no-cleanup
> rados bench -p rbd 72000 write -t 256 -b 4K --no-cleanup
> rados bench -p testk12m4 7200 write -t 256 -b 1M --no-cleanup
> 
> 
> then we start messing with the placement of the hosts change there racks a
> couple of times, set OSDs down randomly, and randomly restarting them (the
> idea is to simulate a long unhealthy cluster, with a lot of osdmaps)
> 
> while true ; do i=$(( ( RANDOM % 48 ) )) ; ./restartOSD \
> $(ceph osd tree | grep up | tail -n $i | head -1 | awk '{print $1}' ) \
> ; done
> 
> restartOSD is a script:
> #!/bin/bash
> osd=$1
> IP=$(./findOSD $osd)
> echo "sshing to $IP";
> ssh $IP "systemctl restart ceph-osd@$osd";
> 
> findOSD is:
> #!/bin/bash
> if [[ $1 =~ ^[1-9]* ]]; then
> ceph osd find $1 | grep ip | sed -e 's/.*\": \"\(.*\):.*/\1/'
> #ceph osd tree | grep -e "host\|osd.$1 " | grep osd.$1 -B1
> else
> echo "Usage $0 OSDNUM"
> fi
> 
> we also set a memory limit in systemd unit file, so the oom kills them and
> make things go faster. we put this
> [Service]
> MemoryLimit=2G
> 
> inside "/etc/systemd/system/ceph-osd@.service.d/memory.conf"
> we start with some thing like 2GB and increase it when ever we feel the limit
> is too harsh. by the time we reach 10GB limit, things are pretty ugly though
> (which, oddly, is good).
> 
> after a while the mon store will grow bigger and bigger. and the amount of ram
> consumed will grow too.
> the target is for the status of the OSDs
> ceph daemon osd.xx status
> will give a difference between the oldest and newest map of about 20000-40000
> epoch.
> 
> at this point, we stop the "restart script" and the "rados bench". if we
> restart all the OSDs, they will consume all the RAM in the node. and either
> the oom will be fast enough to kill them, or the whole node will die. so we
> usually put the memory limit in the unit file at about 20-30 GB at this point
> so we do not loose the node.

Okay, so I think the combination of (1) removing empty PGs and (2) pruning 
past_intervals will help.  (1) can be scripted by looking in 
current/$pg_HEAD directories and picking out the ones with 0 or 1 objects 
in them, doing ceph-objecstore-tool export to make a backup (just in 
case), and then removing them (with ceph-objectstore-tool).  Be careful of 
PGs for empty pools since those will be naturally empty (and you want 
to keep them).

For (2), see the wip-prune-past-intervals-jewel branch in ceph-ci.git.. if 
that is applied to the kraken branch it ought ot work (although it's 
untested).  Alternatively, you can just upgrade to luminous, as it 
implements a more sophisticated version of the same thing.  You need to 
upgrade mons, mark all osds down, upgrade osds and start at least one of 
them, and then set 'ceph osd require-osd-release luminous' before it'll 
switch to the new past intervals representation.  Definitely test it on 
your test cluster to ensure it reduces the memory usage!

If that doesn't sort things out we'll need to see a heap profile for an 
OOMing OSD to make sure we know what is using all of the RAM...

sage

 > 
> > 
> > > if all failed, do you think moving pgs out of the current dir is safe? we
> > > are
> > > trying to test it, but we ll never be sure 100%
> > 
> > It is safe if you use ceph-objectstore-tool export and then remove.  Do
> > not just move the directory around as that will leave behind all kinds of
> > random state in leveldb!
> > 
> > sage
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html