Hi Lucas! On 11/02/2016 09:07 PM, bobobo1618@xxxxxxxxx wrote:
I'm running Kraken built from Git right now and I've found that my OSDs eat as much memory as they can before they're killed by OOM. I understand that Bluestore is experimental but thought the fact that it does this should be known. My setup: - Xeon D-1540, 32GB DDR4 ECC RAM - Arch Linux - Single node, 4 8TB OSDs, each prepared with "ceph-disk prepare --bluestore /dev/sdX" - Built from Git fac6335a1eea12270f76cf2c7814648669e6515a Steps to reproduce: - Start mon - Start OSDs - ceph osd pool create pool 256 256 erasure myprofile storage - rados bench -p pool <time> write -t 32 - ceph osd pool delete pool - ceph osd pool create pool 256 256 replicated - rados bench -p pool <time> write -t 32 - ceph osd pool delete pool The OSDs start at ~500M used each (according to "ceph tell osd.0 heap stats"), before they're allocated PGs. After creating and peering PGs, they're at ~514M each. After running rados bench for 10s, memory is at ~727M each. Running pprof on a dump shows the top entry as: 218.9 96.1% 96.1% 218.9 96.1% ceph::buffer::create_aligned Running rados bench another 10s pushes memory to 836M each. pprof again shows similar results: 305.2 96.8% 96.8% 305.2 96.8% ceph::buffer::create_aligned I can continue this process until the OSDs are killed by OOM. This only happens with Bluestore, other backends (like filestore) work fine. When I delete the pool, the OSDs release the memory and return to their ~500M resting point. Repeating the test with a replicated pool results in the OSDs consuming elevated memory (~610M peak) while writing but returning to resting levels when writing ends. It'd be great if I could do something about this myself but I don't understand the code very well and I can't figure out if there's a way to trace the path taken for the memory to be allocated like there is for CPU usage. Any advice or solution would be much appreciated.
If you have the time and/or inclination, could you see if you could repeat this with one of the OSDs running with massif under valgrind? Basically after the cluster is running as normal, take one of the OSDs, make note of the params used to start it up, then kill it and restart the OSD ala:
valgrind --tool=massif --soname-synonyms=somalloc=*tcmalloc* --massif-out-file=<out file> --log-file=<log file> ceph-osd <osd params>
Once you've run your test for a while and memory usage is up (it might take a while), gently kill it with SIGTERM.
Then you can view the output data with ms_print or with massif-visualizer. This may help narrow down where in the code we are using the memory.
Mark
Thanks! Lucas _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com