Re: Huge memory usage spike in OSD on hammer/giant

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yeh, There is bug which would use huge memory. It be triggered when osd down or add into cluster and do recovery/backfilling.

The patch https://github.com/ceph/ceph/pull/5656 https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and it would be backport.

I think ceph v0.93 or newer version maybe hit this bug.

2015-09-07 20:42 GMT+08:00 Shinobu Kinjo <skinjo@xxxxxxxxxx>:
How heavy network traffic was?

Have you tried to capture that traffic between cluster and public network
to see where such a bunch of traffic came from?

 Shinobu

----- Original Message -----
From: "Jan Schermer" <jan@xxxxxxxxxxx>
To: "Mariusz Gronczewski" <mariusz.gronczewski@xxxxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Monday, September 7, 2015 9:17:04 PM
Subject: Re: Huge memory usage spike in OSD on hammer/giant

Hmm, even network traffic went up.
Nothing in logs on the mons which started 9/4 ~6 AM?

Jan

> On 07 Sep 2015, at 14:11, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:
>
> On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>
>> Maybe some configuration change occured that now takes effect when you start the OSD?
>> Not sure what could affect memory usage though - some ulimit values maybe (stack size), number of OSD threads (compare the number from this OSD to the rest of OSDs), fd cache size. Look in /proc and compare everything.
>> Also look in "ceph osd tree" - didn't someone touch it while you were gone?
>>
>> Jan
>>
>
>> number of OSD threads (compare the number from this OSD to the rest of
> OSDs),
>
> it occured on all OSDs, and it looked like that
> http://imgur.com/IIMIyRG
>
> sadly I was on vacation so I didnt manage to catch it before ;/ but I'm
> sure there was no config change
>
>
>>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:
>>>
>>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>>>
>>>> Apart from bug causing this, this could be caused by failure of other OSDs (even temporary) that starts backfills.
>>>>
>>>> 1) something fails
>>>> 2) some PGs move to this OSD
>>>> 3) this OSD has to allocate memory for all the PGs
>>>> 4) whatever fails gets back up
>>>> 5) the memory is never released.
>>>>
>>>> A similiar scenario is possible if for example someone confuses "ceph osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).
>>>>
>>>> Did you try just restarting the OSD before you upgraded it?
>>>
>>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
>>> that nothing changed. I've tried to wait till it stops eating CPU then
>>> restart it but it still eats >2GB of memory which means I can't start
>>> all 4 OSDs at same time ;/
>>>
>>> I've also added noin,nobackfill,norecover flags but that didnt help
>>>
>>> it is suprising for me because before all 4 OSDs total ate less than
>>> 2GBs of memory so I though I have enough headroom, and we did restart
>>> machines and removed/added os to test if recovery/rebalance goes fine
>>>
>>> it also does not have any external traffic at the moment
>>>
>>>
>>>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> over a weekend (was on vacation so I didnt get exactly what happened)
>>>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a
>>>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700
>>>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs
>>>>> blocked the osds down to unusabiltity.
>>>>>
>>>>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB
>>>>> per osd) but still much higher usage than before.
>>>>>
>>>>> any ideas what would be a reason for that ? logs are mostly full on
>>>>> OSDs trying to recover and timed out heartbeats
>>>>>
>>>>> --
>>>>> Mariusz Gronczewski, Administrator
>>>>>
>>>>> Efigence S. A.
>>>>> ul. Wołoska 9a, 02-583 Warszawa
>>>>> T: [+48] 22 380 13 13
>>>>> F: [+48] 22 380 13 14
>>>>> E: mariusz.gronczewski@xxxxxxxxxxxx
>>>>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>>
>>>
>>> --
>>> Mariusz Gronczewski, Administrator
>>>
>>> Efigence S. A.
>>> ul. Wołoska 9a, 02-583 Warszawa
>>> T: [+48] 22 380 13 13
>>> F: [+48] 22 380 13 14
>>> E: mariusz.gronczewski@xxxxxxxxxxxx
>>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
>>
>
>
>
> --
> Mariusz Gronczewski, Administrator
>
> Efigence S. A.
> ul. Wołoska 9a, 02-583 Warszawa
> T: [+48] 22 380 13 13
> F: [+48] 22 380 13 14
> E: mariusz.gronczewski@xxxxxxxxxxxx
> <mailto:mariusz.gronczewski@xxxxxxxxxxxx>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Regards,
xinze
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux