OOM on OSDS with erasure coding

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

We are testing an erasure coded ceph cluster fronted by rados gateway.
Recently many osds are going down due to out-of-memory.
Here are the details.

Description of the cluster:
====================
ceph version 0.94.2 (hammer)
32 hosts, 6 disks (osds) per host so 32*6 = 192 osds
17024 pgs, 15 pools, 107 TB data, 57616 kobjects
167 TB used, 508 TB / 675 TB available
erasure coding reed-solomon-van with k=10, m=5
We are using rgw as client. erasure coded only for the .rgw.buckets pool
rest of the rgw metadata/index pools are replicated with size=3

The Problem
==========
We ran a load test against this cluster. The load test simply writes
4.5 MB sized objects through a locust test cluster.
We observed very low throughput with saturation on disk iops.
We reasoned that this is because rgw stripe width is 4MB,
which results in the osds splitting it into 4MB/k = 400kb chunks,
which leads to random io behaviour.

To mitigate this, we changed rgw stripe width to 40 MB (so that, after
chunking, the object sizes become 40/k = 4MB) and we modified the load
test to upload 40 MB objects.

Now we observed a more serious problem.
A lot of OSDs across different hosts started getting killed by OOM killer.
We saw that the memory usage of OSDs were huge. ~10G per OSDs.
For comparison, we have a different replicated cluster with a lot more
data where OSD memory usage is ~600MB.

At this point, we stopped the load test, and tried to restart the
individual OSDs.
Even without load, the OSD memory size grows to ~11G

We ran the tcmalloc heap profiler against an OSD. Here is the graph
generated by google-pprof.
http://s33.postimg.org/5w48sr3an/mygif.gif


The graph seems to indicate that most of the memory is being allocated
by PGLog::readLog
Is this expected behaviour? Is there some setting that allows us to
control this?

Please let us know further steps we can take to fix the problem.

Thanks
Sharath
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux