Maybe some configuration change occured that now takes effect when you start the OSD? Not sure what could affect memory usage though - some ulimit values maybe (stack size), number of OSD threads (compare the number from this OSD to the rest of OSDs), fd cache size. Look in /proc and compare everything. Also look in "ceph osd tree" - didn't someone touch it while you were gone? Jan > On 07 Sep 2015, at 13:40, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: > > On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <jan@xxxxxxxxxxx> wrote: > >> Apart from bug causing this, this could be caused by failure of other OSDs (even temporary) that starts backfills. >> >> 1) something fails >> 2) some PGs move to this OSD >> 3) this OSD has to allocate memory for all the PGs >> 4) whatever fails gets back up >> 5) the memory is never released. >> >> A similiar scenario is possible if for example someone confuses "ceph osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)). >> >> Did you try just restarting the OSD before you upgraded it? > > stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside > that nothing changed. I've tried to wait till it stops eating CPU then > restart it but it still eats >2GB of memory which means I can't start > all 4 OSDs at same time ;/ > > I've also added noin,nobackfill,norecover flags but that didnt help > > it is suprising for me because before all 4 OSDs total ate less than > 2GBs of memory so I though I have enough headroom, and we did restart > machines and removed/added os to test if recovery/rebalance goes fine > > it also does not have any external traffic at the moment > > >>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxx> wrote: >>> >>> Hi, >>> >>> over a weekend (was on vacation so I didnt get exactly what happened) >>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a >>> problem considering that we had only 8GB of ram for 4 OSDs (about 700 >>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs >>> blocked the osds down to unusabiltity. >>> >>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB >>> per osd) but still much higher usage than before. >>> >>> any ideas what would be a reason for that ? logs are mostly full on >>> OSDs trying to recover and timed out heartbeats >>> >>> -- >>> Mariusz Gronczewski, Administrator >>> >>> Efigence S. A. >>> ul. Wołoska 9a, 02-583 Warszawa >>> T: [+48] 22 380 13 13 >>> F: [+48] 22 380 13 14 >>> E: mariusz.gronczewski@xxxxxxxxxxxx >>> <mailto:mariusz.gronczewski@xxxxxxxxxxxx> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Mariusz Gronczewski, Administrator > > Efigence S. A. > ul. Wołoska 9a, 02-583 Warszawa > T: [+48] 22 380 13 13 > F: [+48] 22 380 13 14 > E: mariusz.gronczewski@xxxxxxxxxxxx > <mailto:mariusz.gronczewski@xxxxxxxxxxxx> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com