Re: osd memory usage with lots of objects

John Leach <john@xxxxxxxxxxxxxxx> · Wed, 05 Jan 2011 00:39:29 +0000

On Tue, 2011-01-04 at 14:28 -0800, Gregory Farnum wrote:
> On Tue, Jan 4, 2011 at 1:58 PM, John Leach <john@xxxxxxxxxxxxxxx> wrote:
> > Hi,
> >
> > I've got a 3 node test cluster (3 mons, 3 osds) with about 24,000,000
> > very small objects across 2400 pools (written directly with librados,
> > this isn't a ceph filesystem).
> >
> > The cosd processes have steadily grown in ram size and have finally
> > exhausted ram and are getting killed by the oom killer (the nodes have
> > 6gig RAM and no swap).
> >
> > When I start them back up they just very quickly increase in ram size
> > again and get killed.
> >
> > Is this expected?
> No, it's definitely not. :/

excellent news (much better than it being expected!)

> 
> > Do the osds require a certain amount of resident
> > memory relative to the data size (or perhaps number of objects)?
> Well, there's a small amount of memory overhead per-PG and per-pool,
> but the data size and number of objects shouldn't impact it. And I
> presume you haven't been changing your pgnum as you go?

I haven't touched the pg_nums on this cluster that I recall (it's been
up a couple of weeks but has nearly exclusively been used for writing
this test data).

> 
> So, some questions:
> 1) How far through startup do your OSDs get before crashing? Does
> peering complete (I'd expect no)? Can you show us the output of "ceph
> -w" during your attempted startup?

2011-01-05 00:17:58.532524   mon e1: 3 mons at {0=10.135.211.78:6789/0,1=10.61.136.222:6789/0,2=10.202.105.222:6789/0}
2011-01-05 00:22:53.325264   osd e10659: 3 osds: 3 up, 3 in
2011-01-05 00:22:53.383272    pg v151295: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%)
2011-01-05 00:22:53.422433   log 2011-01-05 00:22:53.325027 mon0 10.135.211.78:6789/0 4 : [INF] osd0 10.135.211.78:6801/31836 boot
2011-01-05 00:24:47.301186    pg v151296: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%)

<cosd crashes here>

2011-01-05 00:25:52.422340   log 2011-01-05 00:25:52.189259 mon0 10.135.211.78:6789/0 5 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915)
2011-01-05 00:25:57.265635   log 2011-01-05 00:25:57.121870 mon0 10.135.211.78:6789/0 6 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915)
2011-01-05 00:26:02.341805   osd e10660: 3 osds: 2 up, 3 in
2011-01-05 00:26:02.362526   log 2011-01-05 00:26:02.127627 mon0 10.135.211.78:6789/0 7 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915)
2011-01-05 00:26:02.470942    pg v151297: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%)
2011-01-05 00:26:12.578266    pg v151298: 20936 pgs: 1 creating, 2 peering, 3393 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 7435 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 20728862/49420044 degraded (41.944%)

> 2) Assuming you've built them with tcmalloc, can you enable memory
> profiling before you try and start it up, and post the results
> somewhere? (http://ceph.newdream.net/wiki/Memory_Profiling will get
> you started)

done. attached pprof output from last heap profile before cosd was
killed.

Watching the process carefully, I noticed that it doesn't grow in size
steadily.  It slowly grows to around 2gig (say over a 10 minute period)
then suddenly inflates to 5.5gig in perhaps less than a minute.

> 
> 
> > Can you offer any guidance on planning for ram usage?
> Our target is under a few hundred megabytes. In the past whenever
> we've seen usage higher than this during normal operation we've had
> serious memory leaks. 6GB is way past what the memory requirements
> should ever be, though of course the more RAM you have the more
> file/object data can be cached in-memory which can provide some nice
> boosts in read bandwidth.
> 
> That said, we haven't been very careful about memory usage in our
> peering code and this may be the cause of your problems with starting
> up again. But it wouldn't explain why they ran out of memory to begin
> with.

I was investigating a crash of the osd at the time (definitely not an
oom kill, got a core dump) so that probably started it all off.

John.

p.s: thanks for the help with tcmalloc profiling on irc :)

Attachment:
osd.0.0017.heap.pprof.txt.gz

Description: GNU Zip compressed data