Re: OOM on OSDS with erasure coding

kefu chai <tchaikov@xxxxxxxxx> · Sat, 4 Jun 2016 01:04:06 +0800

On Thu, Jun 2, 2016 at 2:23 PM, Sharath Gururaj <sharath.g@xxxxxxxxxxxx> wrote:
> Hi All,
>
> We are testing an erasure coded ceph cluster fronted by rados gateway.
> Recently many osds are going down due to out-of-memory.
> Here are the details.
>
> Description of the cluster:
> ====================
> ceph version 0.94.2 (hammer)

Sharath, have you tried the latest hammer (v0.94.7)? does it also have
this issue?

> 32 hosts, 6 disks (osds) per host so 32*6 = 192 osds
> 17024 pgs, 15 pools, 107 TB data, 57616 kobjects
> 167 TB used, 508 TB / 675 TB available
> erasure coding reed-solomon-van with k=10, m=5
> We are using rgw as client. erasure coded only for the .rgw.buckets pool
> rest of the rgw metadata/index pools are replicated with size=3
>
> The Problem
> ==========
> We ran a load test against this cluster. The load test simply writes
> 4.5 MB sized objects through a locust test cluster.
> We observed very low throughput with saturation on disk iops.
> We reasoned that this is because rgw stripe width is 4MB,
> which results in the osds splitting it into 4MB/k = 400kb chunks,
> which leads to random io behaviour.
>
> To mitigate this, we changed rgw stripe width to 40 MB (so that, after
> chunking, the object sizes become 40/k = 4MB) and we modified the load
> test to upload 40 MB objects.
>
> Now we observed a more serious problem.
> A lot of OSDs across different hosts started getting killed by OOM killer.
> We saw that the memory usage of OSDs were huge. ~10G per OSDs.
> For comparison, we have a different replicated cluster with a lot more
> data where OSD memory usage is ~600MB.
>
> At this point, we stopped the load test, and tried to restart the
> individual OSDs.
> Even without load, the OSD memory size grows to ~11G
>
> We ran the tcmalloc heap profiler against an OSD. Here is the graph
> generated by google-pprof.
> http://s33.postimg.org/5w48sr3an/mygif.gif
>
>
> The graph seems to indicate that most of the memory is being allocated
> by PGLog::readLog
> Is this expected behaviour? Is there some setting that allows us to
> control this?
>
> Please let us know further steps we can take to fix the problem.
>
> Thanks
> Sharath
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards
Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html