Re: OOM on OSDS with erasure coding

Sharath Gururaj <sharath.g@xxxxxxxxxxxx> · Sat, 4 Jun 2016 00:38:26 +0530

Hi Kefu,

I haven't tried the latest hammer.
Is there any pg_log related fixes that have been applied?

After a little more digging, our current suspicion is that the pg_log
is growing in proportion to the number of file chunks in each OSD.
since we have k=10, m=5 (total 15 chunks for each rados object), the
memory usage has become quite high.

Could you tell a little bit about the implementation/lifecycle of the PGLog?
When are they trimmed? Are there any settings to control how much of
it is kept in memory?

Thanks
Sharath

On Fri, Jun 3, 2016 at 10:34 PM, kefu chai <tchaikov@xxxxxxxxx> wrote:
> On Thu, Jun 2, 2016 at 2:23 PM, Sharath Gururaj <sharath.g@xxxxxxxxxxxx> wrote:
>> Hi All,
>>
>> We are testing an erasure coded ceph cluster fronted by rados gateway.
>> Recently many osds are going down due to out-of-memory.
>> Here are the details.
>>
>> Description of the cluster:
>> ====================
>> ceph version 0.94.2 (hammer)
>
> Sharath, have you tried the latest hammer (v0.94.7)? does it also have
> this issue?
>
>> 32 hosts, 6 disks (osds) per host so 32*6 = 192 osds
>> 17024 pgs, 15 pools, 107 TB data, 57616 kobjects
>> 167 TB used, 508 TB / 675 TB available
>> erasure coding reed-solomon-van with k=10, m=5
>> We are using rgw as client. erasure coded only for the .rgw.buckets pool
>> rest of the rgw metadata/index pools are replicated with size=3
>>
>> The Problem
>> ==========
>> We ran a load test against this cluster. The load test simply writes
>> 4.5 MB sized objects through a locust test cluster.
>> We observed very low throughput with saturation on disk iops.
>> We reasoned that this is because rgw stripe width is 4MB,
>> which results in the osds splitting it into 4MB/k = 400kb chunks,
>> which leads to random io behaviour.
>>
>> To mitigate this, we changed rgw stripe width to 40 MB (so that, after
>> chunking, the object sizes become 40/k = 4MB) and we modified the load
>> test to upload 40 MB objects.
>>
>> Now we observed a more serious problem.
>> A lot of OSDs across different hosts started getting killed by OOM killer.
>> We saw that the memory usage of OSDs were huge. ~10G per OSDs.
>> For comparison, we have a different replicated cluster with a lot more
>> data where OSD memory usage is ~600MB.
>>
>> At this point, we stopped the load test, and tried to restart the
>> individual OSDs.
>> Even without load, the OSD memory size grows to ~11G
>>
>> We ran the tcmalloc heap profiler against an OSD. Here is the graph
>> generated by google-pprof.
>> http://s33.postimg.org/5w48sr3an/mygif.gif
>>
>>
>> The graph seems to indicate that most of the memory is being allocated
>> by PGLog::readLog
>> Is this expected behaviour? Is there some setting that allows us to
>> control this?
>>
>> Please let us know further steps we can take to fix the problem.
>>
>> Thanks
>> Sharath
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Regards
> Kefu Chai
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html