On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote: > Maybe some configuration change occured that now takes effect when you start the OSD? > Not sure what could affect memory usage though - some ulimit values maybe (stack size), number of OSD threads (compare the number from this OSD to the rest of OSDs), fd cache size. Look in /proc and compare everything. > Also look in "ceph osd tree" - didn't someone touch it while you were gone? > > Jan > > number of OSD threads (compare the number from this OSD to the rest of OSDs), it occured on all OSDs, and it looked like that http://imgur.com/IIMIyRG sadly I was on vacation so I didnt manage to catch it before ;/ but I'm sure there was no config change > > On 07 Sep 2015, at 13:40, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: > > > > On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote: > > > >> Apart from bug causing this, this could be caused by failure of other OSDs (even temporary) that starts backfills. > >> > >> 1) something fails > >> 2) some PGs move to this OSD > >> 3) this OSD has to allocate memory for all the PGs > >> 4) whatever fails gets back up > >> 5) the memory is never released. > >> > >> A similiar scenario is possible if for example someone confuses "ceph osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)). > >> > >> Did you try just restarting the OSD before you upgraded it? > > > > stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside > > that nothing changed. I've tried to wait till it stops eating CPU then > > restart it but it still eats >2GB of memory which means I can't start > > all 4 OSDs at same time ;/ > > > > I've also added noin,nobackfill,norecover flags but that didnt help > > > > it is suprising for me because before all 4 OSDs total ate less than > > 2GBs of memory so I though I have enough headroom, and we did restart > > machines and removed/added os to test if recovery/rebalance goes fine > > > > it also does not have any external traffic at the moment > > > > > >>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: > >>> > >>> Hi, > >>> > >>> over a weekend (was on vacation so I didnt get exactly what happened) > >>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a > >>> problem considering that we had only 8GB of ram for 4 OSDs (about 700 > >>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs > >>> blocked the osds down to unusabiltity. > >>> > >>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB > >>> per osd) but still much higher usage than before. > >>> > >>> any ideas what would be a reason for that ? logs are mostly full on > >>> OSDs trying to recover and timed out heartbeats > >>> > >>> -- > >>> Mariusz Gronczewski, Administrator > >>> > >>> Efigence S. A. > >>> ul. Wołoska 9a, 02-583 Warszawa > >>> T: [+48] 22 380 13 13 > >>> F: [+48] 22 380 13 14 > >>> E: mariusz.gronczewski@xxxxxxxxxxxx > >>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > > > -- > > Mariusz Gronczewski, Administrator > > > > Efigence S. A. > > ul. Wołoska 9a, 02-583 Warszawa > > T: [+48] 22 380 13 13 > > F: [+48] 22 380 13 14 > > E: mariusz.gronczewski@xxxxxxxxxxxx > > <mailto:mariusz.gronczewski@xxxxxxxxxxxx> > -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 F: [+48] 22 380 13 14 E: mariusz.gronczewski@xxxxxxxxxxxx <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
Attachment:
pgp66cYRU6PmN.pgp
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com