What distro are you running on? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Apr 14, 2014 at 5:28 AM, David McBride <dwm37@xxxxxxxxx> wrote: > Hello, > > I'm currently experimenting with a Ceph deployment, and am noting that > some of my machines are having processes killed by the OOM killer, > despite provisioning 32GB for a 12 OSD machine. > > (This tended to correlate with reshaping the cluster, which is not > surprising given that OSD memory utilization is documented to spike when > recovery operations are in progress.) > > While the recently-added zRAM kernel facility appears to be helping > somewhat in stretching the available resources, I've been reviewing the > heap utilization statistics displayed via `ceph tell osd.$i heap stats`. > > On a representative process, I see: > >> osd.0tcmalloc heap stats:------------------------------------------------ >> MALLOC: 593850280 ( 566.3 MiB) Bytes in use by application >> MALLOC: + 1621073920 ( 1546.0 MiB) Bytes in page heap freelist >> MALLOC: + 117159712 ( 111.7 MiB) Bytes in central cache freelist >> MALLOC: + 2987008 ( 2.8 MiB) Bytes in transfer cache freelist >> MALLOC: + 84780344 ( 80.9 MiB) Bytes in thread cache freelists >> MALLOC: + 13119640 ( 12.5 MiB) Bytes in malloc metadata >> MALLOC: ------------ >> MALLOC: = 2432970904 ( 2320.3 MiB) Actual memory used (physical + swap) >> MALLOC: + 44449792 ( 42.4 MiB) Bytes released to OS (aka unmapped) >> MALLOC: ------------ >> MALLOC: = 2477420696 ( 2362.7 MiB) Virtual address space used >> MALLOC: >> MALLOC: 60887 Spans in use >> MALLOC: 775 Thread heaps in use >> MALLOC: 8192 Tcmalloc page size >> ------------------------------------------------ > > I noticed there's a huge amount of memory — 1.5GB — on the main > freelist. As an experiment, I ran `ceph tell osd.$i heap release`, and > the amount of memory in use dropped substantially: > >> osd.0tcmalloc heap stats:------------------------------------------------ >> MALLOC: 581434648 ( 554.5 MiB) Bytes in use by application >> MALLOC: + 11509760 ( 11.0 MiB) Bytes in page heap freelist >> MALLOC: + 105904144 ( 101.0 MiB) Bytes in central cache freelist >> MALLOC: + 2070848 ( 2.0 MiB) Bytes in transfer cache freelist >> MALLOC: + 97882520 ( 93.3 MiB) Bytes in thread cache freelists >> MALLOC: + 13119640 ( 12.5 MiB) Bytes in malloc metadata >> MALLOC: ------------ >> MALLOC: = 811921560 ( 774.3 MiB) Actual memory used (physical + swap) >> MALLOC: + 1665499136 ( 1588.3 MiB) Bytes released to OS (aka unmapped) >> MALLOC: ------------ >> MALLOC: = 2477420696 ( 2362.7 MiB) Virtual address space used >> MALLOC: >> MALLOC: 60733 Spans in use >> MALLOC: 803 Thread heaps in use >> MALLOC: 8192 Tcmalloc page size >> ------------------------------------------------ > > This was consistent across all 12 OSDs; running this command on all the > OSDs on a machine dropped memory utilization by ~15GB, or ~50% of the > amount of RAM in my machine. > > Is this expected behaviour? Would it be prudent to treat this as the > amount of memory the Ceph OSDs genuinely requires at peak demand? > (If so, that indicates that I need to be looking to increase the spec of > my storage nodes...) > > I see similar results on my MON nodes. Before a release: > >> mon.ceph-sm000tcmalloc heap stats:------------------------------------------------ >> MALLOC: 599497240 ( 571.7 MiB) Bytes in use by application >> MALLOC: + 806297600 ( 768.9 MiB) Bytes in page heap freelist >> MALLOC: + 32448368 ( 30.9 MiB) Bytes in central cache freelist >> MALLOC: + 1684080 ( 1.6 MiB) Bytes in transfer cache freelist >> MALLOC: + 23270408 ( 22.2 MiB) Bytes in thread cache freelists >> MALLOC: + 5091480 ( 4.9 MiB) Bytes in malloc metadata >> MALLOC: ------------ >> MALLOC: = 1468289176 ( 1400.3 MiB) Actual memory used (physical + swap) >> MALLOC: + 30859264 ( 29.4 MiB) Bytes released to OS (aka unmapped) >> MALLOC: ------------ >> MALLOC: = 1499148440 ( 1429.7 MiB) Virtual address space used >> MALLOC: >> MALLOC: 18309 Spans in use >> MALLOC: 122 Thread heaps in use >> MALLOC: 8192 Tcmalloc page size >> ------------------------------------------------ > > After: > >> mon.ceph-sm000tcmalloc heap stats:------------------------------------------------ >> MALLOC: 600108520 ( 572.3 MiB) Bytes in use by application >> MALLOC: + 17342464 ( 16.5 MiB) Bytes in page heap freelist >> MALLOC: + 32392208 ( 30.9 MiB) Bytes in central cache freelist >> MALLOC: + 964240 ( 0.9 MiB) Bytes in transfer cache freelist >> MALLOC: + 23402360 ( 22.3 MiB) Bytes in thread cache freelists >> MALLOC: + 5091480 ( 4.9 MiB) Bytes in malloc metadata >> MALLOC: ------------ >> MALLOC: = 679301272 ( 647.8 MiB) Actual memory used (physical + swap) >> MALLOC: + 819847168 ( 781.9 MiB) Bytes released to OS (aka unmapped) >> MALLOC: ------------ >> MALLOC: = 1499148440 ( 1429.7 MiB) Virtual address space used >> MALLOC: >> MALLOC: 16396 Spans in use >> MALLOC: 122 Thread heaps in use >> MALLOC: 8192 Tcmalloc page size >> ------------------------------------------------ > > The tcmalloc documentation suggests that memory should be gradually > being returned to the operating system: > > http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html#runtime > > Given these OSDs were largely idle over the weekend prior to running > this experiment, it seems clear that this process is not operating as > designed. > > I've looked through the environment of my running processes and the Ceph > source, and can see no reference to TCMALLOC_RELEASE_RATE or > SetMemoryReleaseRate(). > > I'm currently running an experiment whereby I define > "env TCMALLOC_RELEASE_RATE=10" in > /etc/init/ceph-{osd,mon}.conf.override; I'll see if this has any impact > on memory usage over time. > > (I suspect that my current Ceph cluster placement-group count is > excessive; with 144 OSDs, I'm running with about a dozen pools, each of > which with ~8000 PGs. It's not clear how the guidelines for PG-sizing > should be adjusted for multiple-pool configurations; at some point I'll > see what effect wiping my cluster and using a much smaller per-pool PG > count has.) > > Cheers, > David > -- > David McBride <dwm37@xxxxxxxxx> > Unix Specialist, University Information Services > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html