Re: OOM on OSDS with erasure coding

Sage Weil <sweil@xxxxxxxxxx> · Fri, 3 Jun 2016 16:45:30 -0400 (EDT)



On Sat, 4 Jun 2016, Sharath Gururaj wrote:
> Thanks Samuel,
> 
> But how did you arrive at the number 1.3k per shard osd?
> we have ~17k pgs and 192 osds, giving ~88 PGs per osd.

Each logical PG has k + m shards, so

	17000 * 15 / 192 = 1328

sage

> 
> 
> 
> On Sat, Jun 4, 2016 at 1:24 AM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> > Oh, actually, I think the problem is simply that you have 1.3k pg
> > shards/osd.  We suggest more like 200.  Indeed, the main user of
> > memory for a particular pg is the pg log, so it makes sense that that
> > is where most of the memory would be allocated.  You can probably live
> > with fewer entries/pg: try adjusting osd_min_pg_log_entries and
> > osd_max_pg_log_entries (defaults are 3000 and 10000) down by a factor
> > of 10.
> > -Sam
> >
> > On Fri, Jun 3, 2016 at 12:08 PM, Sharath Gururaj <sharath.g@xxxxxxxxxxxx> wrote:
> >> Hi Kefu,
> >>
> >> I haven't tried the latest hammer.
> >> Is there any pg_log related fixes that have been applied?
> >>
> >> After a little more digging, our current suspicion is that the pg_log
> >> is growing in proportion to the number of file chunks in each OSD.
> >> since we have k=10, m=5 (total 15 chunks for each rados object), the
> >> memory usage has become quite high.
> >>
> >> Could you tell a little bit about the implementation/lifecycle of the PGLog?
> >> When are they trimmed? Are there any settings to control how much of
> >> it is kept in memory?
> >>
> >> Thanks
> >> Sharath
> >>
> >> On Fri, Jun 3, 2016 at 10:34 PM, kefu chai <tchaikov@xxxxxxxxx> wrote:
> >>> On Thu, Jun 2, 2016 at 2:23 PM, Sharath Gururaj <sharath.g@xxxxxxxxxxxx> wrote:
> >>>> Hi All,
> >>>>
> >>>> We are testing an erasure coded ceph cluster fronted by rados gateway.
> >>>> Recently many osds are going down due to out-of-memory.
> >>>> Here are the details.
> >>>>
> >>>> Description of the cluster:
> >>>> ====================
> >>>> ceph version 0.94.2 (hammer)
> >>>
> >>> Sharath, have you tried the latest hammer (v0.94.7)? does it also have
> >>> this issue?
> >>>
> >>>> 32 hosts, 6 disks (osds) per host so 32*6 = 192 osds
> >>>> 17024 pgs, 15 pools, 107 TB data, 57616 kobjects
> >>>> 167 TB used, 508 TB / 675 TB available
> >>>> erasure coding reed-solomon-van with k=10, m=5
> >>>> We are using rgw as client. erasure coded only for the .rgw.buckets pool
> >>>> rest of the rgw metadata/index pools are replicated with size=3
> >>>>
> >>>> The Problem
> >>>> ==========
> >>>> We ran a load test against this cluster. The load test simply writes
> >>>> 4.5 MB sized objects through a locust test cluster.
> >>>> We observed very low throughput with saturation on disk iops.
> >>>> We reasoned that this is because rgw stripe width is 4MB,
> >>>> which results in the osds splitting it into 4MB/k = 400kb chunks,
> >>>> which leads to random io behaviour.
> >>>>
> >>>> To mitigate this, we changed rgw stripe width to 40 MB (so that, after
> >>>> chunking, the object sizes become 40/k = 4MB) and we modified the load
> >>>> test to upload 40 MB objects.
> >>>>
> >>>> Now we observed a more serious problem.
> >>>> A lot of OSDs across different hosts started getting killed by OOM killer.
> >>>> We saw that the memory usage of OSDs were huge. ~10G per OSDs.
> >>>> For comparison, we have a different replicated cluster with a lot more
> >>>> data where OSD memory usage is ~600MB.
> >>>>
> >>>> At this point, we stopped the load test, and tried to restart the
> >>>> individual OSDs.
> >>>> Even without load, the OSD memory size grows to ~11G
> >>>>
> >>>> We ran the tcmalloc heap profiler against an OSD. Here is the graph
> >>>> generated by google-pprof.
> >>>> http://s33.postimg.org/5w48sr3an/mygif.gif
> >>>>
> >>>>
> >>>> The graph seems to indicate that most of the memory is being allocated
> >>>> by PGLog::readLog
> >>>> Is this expected behaviour? Is there some setting that allows us to
> >>>> control this?
> >>>>
> >>>> Please let us know further steps we can take to fix the problem.
> >>>>
> >>>> Thanks
> >>>> Sharath
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>>
> >>> --
> >>> Regards
> >>> Kefu Chai
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html