O.k, that's the protocol, 803.ad. ----- Original Message ----- From: "Mariusz Gronczewski" <mariusz.gronczewski@xxxxxxxxxxxx> To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx> Cc: "Jan Schermer" <jan@xxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx Sent: Monday, September 7, 2015 10:19:23 PM Subject: Re: Huge memory usage spike in OSD on hammer/giant yes On Mon, 7 Sep 2015 09:15:55 -0400 (EDT), Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote: > > master/slave > > Meaning that you are using bonding? > > ----- Original Message ----- > From: "Mariusz Gronczewski" <mariusz.gronczewski@xxxxxxxxxxxx> > To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx> > Cc: "Jan Schermer" <jan@xxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx > Sent: Monday, September 7, 2015 10:05:23 PM > Subject: Re: Huge memory usage spike in OSD on hammer/giant > > nope, master/slave, that's why on graph there is only traffic on eth2 > > On Mon, 7 Sep 2015 09:01:53 -0400 (EDT), Shinobu Kinjo > <skinjo@xxxxxxxxxx> wrote: > > > Are you using lacp in 10g interfaces? > > > > ----- Original Message ----- > > From: "Mariusz Gronczewski" <mariusz.gronczewski@xxxxxxxxxxxx> > > To: "Shinobu Kinjo" <skinjo@xxxxxxxxxx> > > Cc: "Jan Schermer" <jan@xxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx > > Sent: Monday, September 7, 2015 9:58:33 PM > > Subject: Re: Huge memory usage spike in OSD on hammer/giant > > > > that was on 10Gbit interface between OSDs, not from outside, traffic > > from outside was rather low (eth0/1 public - 2/3 cluster) > > > > On Mon, 7 Sep 2015 08:42:09 -0400 (EDT), Shinobu Kinjo > > <skinjo@xxxxxxxxxx> wrote: > > > > > How heavy network traffic was? > > > > > > Have you tried to capture that traffic between cluster and public network > > > to see where such a bunch of traffic came from? > > > > > > Shinobu > > > > > > ----- Original Message ----- > > > From: "Jan Schermer" <jan@xxxxxxxxxxx> > > > To: "Mariusz Gronczewski" <mariusz.gronczewski@xxxxxxxxxxxx> > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Sent: Monday, September 7, 2015 9:17:04 PM > > > Subject: Re: Huge memory usage spike in OSD on hammer/giant > > > > > > Hmm, even network traffic went up. > > > Nothing in logs on the mons which started 9/4 ~6 AM? > > > > > > Jan > > > > > > > On 07 Sep 2015, at 14:11, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: > > > > > > > > On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote: > > > > > > > >> Maybe some configuration change occured that now takes effect when you start the OSD? > > > >> Not sure what could affect memory usage though - some ulimit values maybe (stack size), number of OSD threads (compare the number from this OSD to the rest of OSDs), fd cache size. Look in /proc and compare everything. > > > >> Also look in "ceph osd tree" - didn't someone touch it while you were gone? > > > >> > > > >> Jan > > > >> > > > > > > > >> number of OSD threads (compare the number from this OSD to the rest of > > > > OSDs), > > > > > > > > it occured on all OSDs, and it looked like that > > > > http://imgur.com/IIMIyRG > > > > > > > > sadly I was on vacation so I didnt manage to catch it before ;/ but I'm > > > > sure there was no config change > > > > > > > > > > > >>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: > > > >>> > > > >>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote: > > > >>> > > > >>>> Apart from bug causing this, this could be caused by failure of other OSDs (even temporary) that starts backfills. > > > >>>> > > > >>>> 1) something fails > > > >>>> 2) some PGs move to this OSD > > > >>>> 3) this OSD has to allocate memory for all the PGs > > > >>>> 4) whatever fails gets back up > > > >>>> 5) the memory is never released. > > > >>>> > > > >>>> A similiar scenario is possible if for example someone confuses "ceph osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)). > > > >>>> > > > >>>> Did you try just restarting the OSD before you upgraded it? > > > >>> > > > >>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside > > > >>> that nothing changed. I've tried to wait till it stops eating CPU then > > > >>> restart it but it still eats >2GB of memory which means I can't start > > > >>> all 4 OSDs at same time ;/ > > > >>> > > > >>> I've also added noin,nobackfill,norecover flags but that didnt help > > > >>> > > > >>> it is suprising for me because before all 4 OSDs total ate less than > > > >>> 2GBs of memory so I though I have enough headroom, and we did restart > > > >>> machines and removed/added os to test if recovery/rebalance goes fine > > > >>> > > > >>> it also does not have any external traffic at the moment > > > >>> > > > >>> > > > >>>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: > > > >>>>> > > > >>>>> Hi, > > > >>>>> > > > >>>>> over a weekend (was on vacation so I didnt get exactly what happened) > > > >>>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a > > > >>>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700 > > > >>>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs > > > >>>>> blocked the osds down to unusabiltity. > > > >>>>> > > > >>>>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB > > > >>>>> per osd) but still much higher usage than before. > > > >>>>> > > > >>>>> any ideas what would be a reason for that ? logs are mostly full on > > > >>>>> OSDs trying to recover and timed out heartbeats > > > >>>>> > > > >>>>> -- > > > >>>>> Mariusz Gronczewski, Administrator > > > >>>>> > > > >>>>> Efigence S. A. > > > >>>>> ul. Wołoska 9a, 02-583 Warszawa > > > >>>>> T: [+48] 22 380 13 13 > > > >>>>> F: [+48] 22 380 13 14 > > > >>>>> E: mariusz.gronczewski@xxxxxxxxxxxx > > > >>>>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx> > > > >>>>> _______________________________________________ > > > >>>>> ceph-users mailing list > > > >>>>> ceph-users@xxxxxxxxxxxxxx > > > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > >>>> > > > >>> > > > >>> > > > >>> > > > >>> -- > > > >>> Mariusz Gronczewski, Administrator > > > >>> > > > >>> Efigence S. A. > > > >>> ul. Wołoska 9a, 02-583 Warszawa > > > >>> T: [+48] 22 380 13 13 > > > >>> F: [+48] 22 380 13 14 > > > >>> E: mariusz.gronczewski@xxxxxxxxxxxx > > > >>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx> > > > >> > > > > > > > > > > > > > > > > -- > > > > Mariusz Gronczewski, Administrator > > > > > > > > Efigence S. A. > > > > ul. Wołoska 9a, 02-583 Warszawa > > > > T: [+48] 22 380 13 13 > > > > F: [+48] 22 380 13 14 > > > > E: mariusz.gronczewski@xxxxxxxxxxxx > > > > <mailto:mariusz.gronczewski@xxxxxxxxxxxx> > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 F: [+48] 22 380 13 14 E: mariusz.gronczewski@xxxxxxxxxxxx <mailto:mariusz.gronczewski@xxxxxxxxxxxx> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com