Re: Huge memory usage spike in OSD on hammer/giant

Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> · Mon, 7 Sep 2015 15:00:46 +0200



I first had that problem on Giant, only upgraded to 0.94.3 in hopes it
will eat less RAM

On Mon, 7 Sep 2015 20:51:57 +0800, 池信泽 <xmdxcxz@xxxxxxxxx> wrote:

> Yeh, There is bug which would use huge memory. It be triggered when osd
> down or add into cluster and do recovery/backfilling.
> 
> The patch https://github.com/ceph/ceph/pull/5656
> https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and
> it would be backport.
> 
> I think ceph v0.93 or newer version maybe hit this bug.
> 
> 2015-09-07 20:42 GMT+08:00 Shinobu Kinjo <skinjo@xxxxxxxxxx>:
> 
> > How heavy network traffic was?
> >
> > Have you tried to capture that traffic between cluster and public network
> > to see where such a bunch of traffic came from?
> >
> >  Shinobu
> >
> > ----- Original Message -----
> > From: "Jan Schermer" <jan@xxxxxxxxxxx>
> > To: "Mariusz Gronczewski" <mariusz.gronczewski@xxxxxxxxxxxx>
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Sent: Monday, September 7, 2015 9:17:04 PM
> > Subject: Re:  Huge memory usage spike in OSD on hammer/giant
> >
> > Hmm, even network traffic went up.
> > Nothing in logs on the mons which started 9/4 ~6 AM?
> >
> > Jan
> >
> > > On 07 Sep 2015, at 14:11, Mariusz Gronczewski <
> > mariusz.gronczewski@xxxxxxxxxxxx> wrote:
> > >
> > > On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> > >
> > >> Maybe some configuration change occured that now takes effect when you
> > start the OSD?
> > >> Not sure what could affect memory usage though - some ulimit values
> > maybe (stack size), number of OSD threads (compare the number from this OSD
> > to the rest of OSDs), fd cache size. Look in /proc and compare everything.
> > >> Also look in "ceph osd tree" - didn't someone touch it while you were
> > gone?
> > >>
> > >> Jan
> > >>
> > >
> > >> number of OSD threads (compare the number from this OSD to the rest of
> > > OSDs),
> > >
> > > it occured on all OSDs, and it looked like that
> > > http://imgur.com/IIMIyRG
> > >
> > > sadly I was on vacation so I didnt manage to catch it before ;/ but I'm
> > > sure there was no config change
> > >
> > >
> > >>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <
> > mariusz.gronczewski@xxxxxxxxxxxx> wrote:
> > >>>
> > >>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <jan@xxxxxxxxxxx>
> > wrote:
> > >>>
> > >>>> Apart from bug causing this, this could be caused by failure of other
> > OSDs (even temporary) that starts backfills.
> > >>>>
> > >>>> 1) something fails
> > >>>> 2) some PGs move to this OSD
> > >>>> 3) this OSD has to allocate memory for all the PGs
> > >>>> 4) whatever fails gets back up
> > >>>> 5) the memory is never released.
> > >>>>
> > >>>> A similiar scenario is possible if for example someone confuses "ceph
> > osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).
> > >>>>
> > >>>> Did you try just restarting the OSD before you upgraded it?
> > >>>
> > >>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
> > >>> that nothing changed. I've tried to wait till it stops eating CPU then
> > >>> restart it but it still eats >2GB of memory which means I can't start
> > >>> all 4 OSDs at same time ;/
> > >>>
> > >>> I've also added noin,nobackfill,norecover flags but that didnt help
> > >>>
> > >>> it is suprising for me because before all 4 OSDs total ate less than
> > >>> 2GBs of memory so I though I have enough headroom, and we did restart
> > >>> machines and removed/added os to test if recovery/rebalance goes fine
> > >>>
> > >>> it also does not have any external traffic at the moment
> > >>>
> > >>>
> > >>>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <
> > mariusz.gronczewski@xxxxxxxxxxxx> wrote:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> over a weekend (was on vacation so I didnt get exactly what happened)
> > >>>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which
> > was a
> > >>>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700
> > >>>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs
> > >>>>> blocked the osds down to unusabiltity.
> > >>>>>
> > >>>>> I then upgraded one of OSDs to hammer which made it a bit better
> > (~2GB
> > >>>>> per osd) but still much higher usage than before.
> > >>>>>
> > >>>>> any ideas what would be a reason for that ? logs are mostly full on
> > >>>>> OSDs trying to recover and timed out heartbeats
> > >>>>>
> > >>>>> --
> > >>>>> Mariusz Gronczewski, Administrator
> > >>>>>
> > >>>>> Efigence S. A.
> > >>>>> ul. Wołoska 9a, 02-583 Warszawa
> > >>>>> T: [+48] 22 380 13 13
> > >>>>> F: [+48] 22 380 13 14
> > >>>>> E: mariusz.gronczewski@xxxxxxxxxxxx
> > >>>>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
> > >>>>> _______________________________________________
> > >>>>> ceph-users mailing list
> > >>>>> ceph-users@xxxxxxxxxxxxxx
> > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Mariusz Gronczewski, Administrator
> > >>>
> > >>> Efigence S. A.
> > >>> ul. Wołoska 9a, 02-583 Warszawa
> > >>> T: [+48] 22 380 13 13
> > >>> F: [+48] 22 380 13 14
> > >>> E: mariusz.gronczewski@xxxxxxxxxxxx
> > >>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
> > >>
> > >
> > >
> > >
> > > --
> > > Mariusz Gronczewski, Administrator
> > >
> > > Efigence S. A.
> > > ul. Wołoska 9a, 02-583 Warszawa
> > > T: [+48] 22 380 13 13
> > > F: [+48] 22 380 13 14
> > > E: mariusz.gronczewski@xxxxxxxxxxxx
> > > <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> 


-- 
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczewski@xxxxxxxxxxxx
<mailto:mariusz.gronczewski@xxxxxxxxxxxx>
Attachment:
pgpqhWvXpcqpN.pgp

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com