Re: Bluestore + erasure coding memory usage

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 3 Nov 2016 07:07:59 -0500

Hi Lucas!

On 11/02/2016 09:07 PM, bobobo1618@xxxxxxxxx wrote:
I'm running Kraken built from Git right now and I've found that my OSDs
eat as much memory as they can before they're killed by OOM. I
understand that Bluestore is experimental but thought the fact that it
does this should be known.

My setup:
- Xeon D-1540, 32GB DDR4 ECC RAM
- Arch Linux
- Single node, 4 8TB OSDs, each prepared with "ceph-disk prepare
--bluestore /dev/sdX"
- Built from Git fac6335a1eea12270f76cf2c7814648669e6515a

Steps to reproduce:
- Start mon
- Start OSDs
- ceph osd pool create pool 256 256 erasure myprofile storage
- rados bench -p pool <time> write -t 32
- ceph osd pool delete pool
- ceph osd pool create pool 256 256 replicated
- rados bench -p pool <time> write -t 32
- ceph osd pool delete pool

The OSDs start at ~500M used each (according to "ceph tell osd.0 heap
stats"), before they're allocated PGs. After creating and peering PGs,
they're at ~514M each.

After running rados bench for 10s, memory is at ~727M each. Running
pprof on a dump shows the top entry as:

218.9  96.1%  96.1%    218.9  96.1% ceph::buffer::create_aligned

Running rados bench another 10s pushes memory to 836M each. pprof again
shows similar results:

305.2  96.8%  96.8%    305.2  96.8% ceph::buffer::create_aligned

I can continue this process until the OSDs are killed by OOM.

This only happens with Bluestore, other backends (like filestore) work fine.

When I delete the pool, the OSDs release the memory and return to their
~500M resting point.

Repeating the test with a replicated pool results in the OSDs consuming
elevated memory (~610M peak) while writing but returning to resting
levels when writing ends.

It'd be great if I could do something about this myself but I don't
understand the code very well and I can't figure out if there's a way to
trace the path taken for the memory to be allocated like there is for
CPU usage.

Any advice or solution would be much appreciated.

If you have the time and/or inclination, could you see if you could 
repeat this with one of the OSDs running with massif under valgrind? 
Basically after the cluster is running as normal, take one of the OSDs, 
make note of the params used to start it up, then kill it and restart 
the OSD ala:

valgrind --tool=massif --soname-synonyms=somalloc=*tcmalloc* 
--massif-out-file=<out file> --log-file=<log file> ceph-osd <osd params>

Once you've run your test for a while and memory usage is up (it might 
take a while), gently kill it with SIGTERM.

Then you can view the output data with ms_print or with 
massif-visualizer.  This may help narrow down where in the code we are 
using the memory.

Mark

Thanks!

Lucas

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com