Re: Huge memory usage spike in OSD on hammer/giant

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 7 Sep 2015 13:44:55 +0200

Maybe some configuration change occured that now takes effect when you start the OSD?
Not sure what could affect memory usage though - some ulimit values maybe (stack size), number of OSD threads (compare the number from this OSD to the rest of OSDs), fd cache size. Look in /proc and compare everything.
Also look in "ceph osd tree" - didn't someone touch it while you were gone?

Jan

> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:
> 
> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> 
>> Apart from bug causing this, this could be caused by failure of other OSDs (even temporary) that starts backfills.
>> 
>> 1) something fails
>> 2) some PGs move to this OSD
>> 3) this OSD has to allocate memory for all the PGs
>> 4) whatever fails gets back up
>> 5) the memory is never released.
>> 
>> A similiar scenario is possible if for example someone confuses "ceph osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).
>> 
>> Did you try just restarting the OSD before you upgraded it?
> 
> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
> that nothing changed. I've tried to wait till it stops eating CPU then
> restart it but it still eats >2GB of memory which means I can't start
> all 4 OSDs at same time ;/
> 
> I've also added noin,nobackfill,norecover flags but that didnt help
> 
> it is suprising for me because before all 4 OSDs total ate less than
> 2GBs of memory so I though I have enough headroom, and we did restart
> machines and removed/added os to test if recovery/rebalance goes fine
> 
> it also does not have any external traffic at the moment
> 
> 
>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:
>>> 
>>> Hi,
>>> 
>>> over a weekend (was on vacation so I didnt get exactly what happened)
>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a
>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700
>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs
>>> blocked the osds down to unusabiltity.
>>> 
>>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB
>>> per osd) but still much higher usage than before.
>>> 
>>> any ideas what would be a reason for that ? logs are mostly full on
>>> OSDs trying to recover and timed out heartbeats
>>> 
>>> -- 
>>> Mariusz Gronczewski, Administrator
>>> 
>>> Efigence S. A.
>>> ul. Wołoska 9a, 02-583 Warszawa
>>> T: [+48] 22 380 13 13
>>> F: [+48] 22 380 13 14
>>> E: mariusz.gronczewski@xxxxxxxxxxxx
>>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> 
> -- 
> Mariusz Gronczewski, Administrator
> 
> Efigence S. A.
> ul. Wołoska 9a, 02-583 Warszawa
> T: [+48] 22 380 13 13
> F: [+48] 22 380 13 14
> E: mariusz.gronczewski@xxxxxxxxxxxx
> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com