On Sat, 4 Jun 2016, Sharath Gururaj wrote: > Thanks Samuel, > > But how did you arrive at the number 1.3k per shard osd? > we have ~17k pgs and 192 osds, giving ~88 PGs per osd. Each logical PG has k + m shards, so 17000 * 15 / 192 = 1328 sage > > > > On Sat, Jun 4, 2016 at 1:24 AM, Samuel Just <sjust@xxxxxxxxxx> wrote: > > Oh, actually, I think the problem is simply that you have 1.3k pg > > shards/osd. We suggest more like 200. Indeed, the main user of > > memory for a particular pg is the pg log, so it makes sense that that > > is where most of the memory would be allocated. You can probably live > > with fewer entries/pg: try adjusting osd_min_pg_log_entries and > > osd_max_pg_log_entries (defaults are 3000 and 10000) down by a factor > > of 10. > > -Sam > > > > On Fri, Jun 3, 2016 at 12:08 PM, Sharath Gururaj <sharath.g@xxxxxxxxxxxx> wrote: > >> Hi Kefu, > >> > >> I haven't tried the latest hammer. > >> Is there any pg_log related fixes that have been applied? > >> > >> After a little more digging, our current suspicion is that the pg_log > >> is growing in proportion to the number of file chunks in each OSD. > >> since we have k=10, m=5 (total 15 chunks for each rados object), the > >> memory usage has become quite high. > >> > >> Could you tell a little bit about the implementation/lifecycle of the PGLog? > >> When are they trimmed? Are there any settings to control how much of > >> it is kept in memory? > >> > >> Thanks > >> Sharath > >> > >> On Fri, Jun 3, 2016 at 10:34 PM, kefu chai <tchaikov@xxxxxxxxx> wrote: > >>> On Thu, Jun 2, 2016 at 2:23 PM, Sharath Gururaj <sharath.g@xxxxxxxxxxxx> wrote: > >>>> Hi All, > >>>> > >>>> We are testing an erasure coded ceph cluster fronted by rados gateway. > >>>> Recently many osds are going down due to out-of-memory. > >>>> Here are the details. > >>>> > >>>> Description of the cluster: > >>>> ==================== > >>>> ceph version 0.94.2 (hammer) > >>> > >>> Sharath, have you tried the latest hammer (v0.94.7)? does it also have > >>> this issue? > >>> > >>>> 32 hosts, 6 disks (osds) per host so 32*6 = 192 osds > >>>> 17024 pgs, 15 pools, 107 TB data, 57616 kobjects > >>>> 167 TB used, 508 TB / 675 TB available > >>>> erasure coding reed-solomon-van with k=10, m=5 > >>>> We are using rgw as client. erasure coded only for the .rgw.buckets pool > >>>> rest of the rgw metadata/index pools are replicated with size=3 > >>>> > >>>> The Problem > >>>> ========== > >>>> We ran a load test against this cluster. The load test simply writes > >>>> 4.5 MB sized objects through a locust test cluster. > >>>> We observed very low throughput with saturation on disk iops. > >>>> We reasoned that this is because rgw stripe width is 4MB, > >>>> which results in the osds splitting it into 4MB/k = 400kb chunks, > >>>> which leads to random io behaviour. > >>>> > >>>> To mitigate this, we changed rgw stripe width to 40 MB (so that, after > >>>> chunking, the object sizes become 40/k = 4MB) and we modified the load > >>>> test to upload 40 MB objects. > >>>> > >>>> Now we observed a more serious problem. > >>>> A lot of OSDs across different hosts started getting killed by OOM killer. > >>>> We saw that the memory usage of OSDs were huge. ~10G per OSDs. > >>>> For comparison, we have a different replicated cluster with a lot more > >>>> data where OSD memory usage is ~600MB. > >>>> > >>>> At this point, we stopped the load test, and tried to restart the > >>>> individual OSDs. > >>>> Even without load, the OSD memory size grows to ~11G > >>>> > >>>> We ran the tcmalloc heap profiler against an OSD. Here is the graph > >>>> generated by google-pprof. > >>>> http://s33.postimg.org/5w48sr3an/mygif.gif > >>>> > >>>> > >>>> The graph seems to indicate that most of the memory is being allocated > >>>> by PGLog::readLog > >>>> Is this expected behaviour? Is there some setting that allows us to > >>>> control this? > >>>> > >>>> Please let us know further steps we can take to fix the problem. > >>>> > >>>> Thanks > >>>> Sharath > >>>> -- > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >>> > >>> > >>> -- > >>> Regards > >>> Kefu Chai > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html