Re: Huge memory usage spike in OSD on hammer/giant

池信泽 <xmdxcxz@xxxxxxxxx> · Mon, 7 Sep 2015 20:51:57 +0800

Yeh, There is bug which would use huge memory. It be triggered when osd down or add into cluster and do recovery/backfilling.
The patch https://github.com/ceph/ceph/pull/5656 https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and it would be backport.

I think ceph v0.93 or newer version maybe hit this bug.

2015-09-07 20:42 GMT+08:00 Shinobu Kinjo <skinjo@xxxxxxxxxx>:
How heavy network traffic was?

Have you tried to capture that traffic between cluster and public network

to see where such a bunch of traffic came from?

 Shinobu

----- Original Message -----

From: "Jan Schermer" <jan@xxxxxxxxxxx>

To: "Mariusz Gronczewski" <mariusz.gronczewski@xxxxxxxxxxxx>

Cc: ceph-users@xxxxxxxxxxxxxx

Sent: Monday, September 7, 2015 9:17:04 PM

Subject: Re:  Huge memory usage spike in OSD on hammer/giant

Hmm, even network traffic went up.

Nothing in logs on the mons which started 9/4 ~6 AM?

Jan

> On 07 Sep 2015, at 14:11, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:

>

> On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote:

>

>> Maybe some configuration change occured that now takes effect when you start the OSD?

>> Not sure what could affect memory usage though - some ulimit values maybe (stack size), number of OSD threads (compare the number from this OSD to the rest of OSDs), fd cache size. Look in /proc and compare everything.

>> Also look in "ceph osd tree" - didn't someone touch it while you were gone?

>>

>> Jan

>>

>

>> number of OSD threads (compare the number from this OSD to the rest of

> OSDs),

>

> it occured on all OSDs, and it looked like that

> http://imgur.com/IIMIyRG

>

> sadly I was on vacation so I didnt manage to catch it before ;/ but I'm

> sure there was no config change

>

>

>>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:

>>>

>>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote:

>>>

>>>> Apart from bug causing this, this could be caused by failure of other OSDs (even temporary) that starts backfills.

>>>>

>>>> 1) something fails

>>>> 2) some PGs move to this OSD

>>>> 3) this OSD has to allocate memory for all the PGs

>>>> 4) whatever fails gets back up

>>>> 5) the memory is never released.

>>>>

>>>> A similiar scenario is possible if for example someone confuses "ceph osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).

>>>>

>>>> Did you try just restarting the OSD before you upgraded it?

>>>

>>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside

>>> that nothing changed. I've tried to wait till it stops eating CPU then

>>> restart it but it still eats >2GB of memory which means I can't start

>>> all 4 OSDs at same time ;/

>>>

>>> I've also added noin,nobackfill,norecover flags but that didnt help

>>>

>>> it is suprising for me because before all 4 OSDs total ate less than

>>> 2GBs of memory so I though I have enough headroom, and we did restart

>>> machines and removed/added os to test if recovery/rebalance goes fine

>>>

>>> it also does not have any external traffic at the moment

>>>

>>>

>>>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:

>>>>>

>>>>> Hi,

>>>>>

>>>>> over a weekend (was on vacation so I didnt get exactly what happened)

>>>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a

>>>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700

>>>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs

>>>>> blocked the osds down to unusabiltity.

>>>>>

>>>>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB

>>>>> per osd) but still much higher usage than before.

>>>>>

>>>>> any ideas what would be a reason for that ? logs are mostly full on

>>>>> OSDs trying to recover and timed out heartbeats

>>>>>

>>>>> --

>>>>> Mariusz Gronczewski, Administrator

>>>>>

>>>>> Efigence S. A.

>>>>> ul. Wołoska 9a, 02-583 Warszawa

>>>>> T: [+48] 22 380 13 13

>>>>> F: [+48] 22 380 13 14

>>>>> E: mariusz.gronczewski@xxxxxxxxxxxx

>>>>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>

>>>>> _______________________________________________

>>>>> ceph-users mailing list

>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>

>>>

>>>

>>>

>>> --

>>> Mariusz Gronczewski, Administrator

>>>

>>> Efigence S. A.

>>> ul. Wołoska 9a, 02-583 Warszawa

>>> T: [+48] 22 380 13 13

>>> F: [+48] 22 380 13 14

>>> E: mariusz.gronczewski@xxxxxxxxxxxx

>>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>

>>

>

>

>

> --

> Mariusz Gronczewski, Administrator

>

> Efigence S. A.

> ul. Wołoska 9a, 02-583 Warszawa

> T: [+48] 22 380 13 13

> F: [+48] 22 380 13 14

> E: mariusz.gronczewski@xxxxxxxxxxxx

> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Regards,
xinze

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com