OOM on OSDS with erasure coding

Sharath Gururaj <sharath.g@xxxxxxxxxxxx> · Wed, 1 Jun 2016 18:24:29 +0530

Hi All,

We are testing a erasure coded pool fronted by rados gateway.
Recently many osds are going down due to out-of-memory.
Here are the details.

Description of the cluster:
32 hosts, 6 disks (osds) per host so 32*6 = 192 osds
17024 pgs, 15 pools, 107 TB data, 57616 kobjects
167 TB used, 508 TB / 675 TB available
erasure coding reed-solomon-van with k=10, m=5
We are using rgw as client. erasure coded only for the .rgw.buckets pool
rest of the rgw metadata/index pools are replicated with size=3
#of virtual CPUs: 20
RAM: 53GB

The Problem
We ran a load test against this cluster. The load test simply writes 4.5 MB sized objects through a locust test cluster.
We observed very low throughput with saturation on disk iops.
We reasoned that this is because rgw stripe width is 4MB, 
which results in the osds splitting it into 4MB/k = 400kb chunks, which leads to random io behaviour.

To mitigate this, we changed rgw stripe width to 40 MB (so that, after chunking, the object sizes become 40/k = 4MB) and we modified the load test to upload 40 MB objects. We did all this without recreating the cluster.

Now we observed a more serious problem.
A lot of OSDs across different hosts started getting killed by linux OOM killer.
We saw that the memory usage of OSDs were huge. ~10G per OSDs.
For comparison, we have a different replicated cluster with a lot more data where OSD memory usage is ~600MB.

At this point, we stopped the load test, and tried to restart the individual OSDs.
Even without load, the OSD memory size grows to ~11G

We ran the tcmalloc heap profiler against an OSD. Here is the graph generated by google-pprof.
http://s33.postimg.org/5w48sr3an/mygif.gif

Note that tcmalloc reports only 4GB heap space. but there is an extra 6 GB of anonymous memory mapping that we can see through /proc/{osd}/maps that we are not able to account for.

The graph seems to indicate that reading the LevelDB store is somehow allocating a lot of memory.

Please let us know further steps we can take to identify and fix the problem.
Thanks

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com