On Tue, 2011-01-04 at 14:28 -0800, Gregory Farnum wrote: > On Tue, Jan 4, 2011 at 1:58 PM, John Leach <john@xxxxxxxxxxxxxxx> wrote: > > Hi, > > > > I've got a 3 node test cluster (3 mons, 3 osds) with about 24,000,000 > > very small objects across 2400 pools (written directly with librados, > > this isn't a ceph filesystem). > > > > The cosd processes have steadily grown in ram size and have finally > > exhausted ram and are getting killed by the oom killer (the nodes have > > 6gig RAM and no swap). > > > > When I start them back up they just very quickly increase in ram size > > again and get killed. > > > > Is this expected? > No, it's definitely not. :/ excellent news (much better than it being expected!) > > > Do the osds require a certain amount of resident > > memory relative to the data size (or perhaps number of objects)? > Well, there's a small amount of memory overhead per-PG and per-pool, > but the data size and number of objects shouldn't impact it. And I > presume you haven't been changing your pgnum as you go? I haven't touched the pg_nums on this cluster that I recall (it's been up a couple of weeks but has nearly exclusively been used for writing this test data). > > So, some questions: > 1) How far through startup do your OSDs get before crashing? Does > peering complete (I'd expect no)? Can you show us the output of "ceph > -w" during your attempted startup? 2011-01-05 00:17:58.532524 mon e1: 3 mons at {0=10.135.211.78:6789/0,1=10.61.136.222:6789/0,2=10.202.105.222:6789/0} 2011-01-05 00:22:53.325264 osd e10659: 3 osds: 3 up, 3 in 2011-01-05 00:22:53.383272 pg v151295: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%) 2011-01-05 00:22:53.422433 log 2011-01-05 00:22:53.325027 mon0 10.135.211.78:6789/0 4 : [INF] osd0 10.135.211.78:6801/31836 boot 2011-01-05 00:24:47.301186 pg v151296: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%) <cosd crashes here> 2011-01-05 00:25:52.422340 log 2011-01-05 00:25:52.189259 mon0 10.135.211.78:6789/0 5 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915) 2011-01-05 00:25:57.265635 log 2011-01-05 00:25:57.121870 mon0 10.135.211.78:6789/0 6 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915) 2011-01-05 00:26:02.341805 osd e10660: 3 osds: 2 up, 3 in 2011-01-05 00:26:02.362526 log 2011-01-05 00:26:02.127627 mon0 10.135.211.78:6789/0 7 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915) 2011-01-05 00:26:02.470942 pg v151297: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%) 2011-01-05 00:26:12.578266 pg v151298: 20936 pgs: 1 creating, 2 peering, 3393 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 7435 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 20728862/49420044 degraded (41.944%) > 2) Assuming you've built them with tcmalloc, can you enable memory > profiling before you try and start it up, and post the results > somewhere? (http://ceph.newdream.net/wiki/Memory_Profiling will get > you started) done. attached pprof output from last heap profile before cosd was killed. Watching the process carefully, I noticed that it doesn't grow in size steadily. It slowly grows to around 2gig (say over a 10 minute period) then suddenly inflates to 5.5gig in perhaps less than a minute. > > > > Can you offer any guidance on planning for ram usage? > Our target is under a few hundred megabytes. In the past whenever > we've seen usage higher than this during normal operation we've had > serious memory leaks. 6GB is way past what the memory requirements > should ever be, though of course the more RAM you have the more > file/object data can be cached in-memory which can provide some nice > boosts in read bandwidth. > > That said, we haven't been very careful about memory usage in our > peering code and this may be the cause of your problems with starting > up again. But it wouldn't explain why they ran out of memory to begin > with. I was investigating a crash of the osd at the time (definitely not an oom kill, got a core dump) so that probably started it all off. John. p.s: thanks for the help with tcmalloc profiling on irc :)
Attachment:
osd.0.0017.heap.pprof.txt.gz
Description: GNU Zip compressed data