Re: OSD and MON memory usage

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 19 Nov 2012 15:15:33 -0800

On Fri, Nov 16, 2012 at 9:24 AM, Cláudio Martins <ctpm@xxxxxxxxxx> wrote:
>
>  Hi,
>
>  We're testing ceph using a recent build from the 'next' branch (commit
> b40387d) and we've run into some interesting problems related to memory
> usage.
>
>  The setup consists of 64 OSDs (4 boxes, each with 16 disks, most of
> them 2TB, some 1.5TB, XFS filesystems, Debian Wheezy). After the
> initial mkcephfs, a 'ceph -s' reports 12480 pgs total.
>
>  For generating some load we used
>
> rados -p rbd bench 28000 write -t 25
>
> and left it running overnight.
>
>  After several hours most of the OSDs had eaten up around 1GB or more
> of memory each, which caused trashing on the servers (12GB of RAM
> per box), and eventually the OOM killer was invoked, killing many OSDs
> and even the SSH daemons. This seems to have caused a domino effect,
> and in the morning only around 18 of the OSD were still up.

That's not good! Sam thinks you may have hit a memory leak, and I
believe I heard him making noises about discovering it elsewhere
earlier today. I will let him talk about that.

>  After a hard reboot of the boxes that were unresponsive, we are now in
> a situation in which there is simply not enough memory for the cluster
> to recover. That is, after restarting the OSDs, in 2 to 3 minutes we
> have many of them using 1~1.5GB of RAM and the trashing starts all over
> again, the OOM killer comes in and things go downhill again. Efectively
> the cluster is not able to recover no matter how many times we restart
> the daemons.
>
>  We're not using any non-default options in the OSD section of the
> config. file. We checked that there is free space for logging on the
> system partitions.
>
>  While I know that 12GB per machine can be hardly called to much RAM,
> the question I put forward is: is it reasonable for a OSD to consume so
> much memory in normal usage, or even recovery situations, when there is
> just around ~200 PGs per OSD and only around ~3TB of objects created by
> rados bench?

In normal usage, absolutely not. In recovery, for now, yes, we
consider that reasonable usage. We may be able to continue bringing
that down, but since it's a good number to have available for page
cache anyway we haven't been focusing on it much lately.

>  Is there a rule of thumb to estimate the amount of memory consumed as
> a function of PG count, object count and perhaps the number of PGs
> trying to recover in a given instant? One of my concerns here is also
> to understand if memory consumption during recovery is bounded and
> deterministic at all, or if we're simply hitting a severe memory leak
> in the OSDs.

It's going to vary based mostly on the number of PGs it's going to
recover, and especially based on if it's recovering or not. We
generally recommend 1GB per daemon for a well-configured (PGs in the
50-200 per OSD range) cluster.

>  As for the monitor daemon on this cluster (running on a dedicated
> machine), it is currently using 3.2GB of memory, and it got to that
> point again in a matter of minutes after being restarted. Would it be
> good if we tested with the changes from the wip-mon-leaks-fix branch?

Can you restart your monitor, then run "ceph heap start_profiler", and
then once the memory has gone way up run "ceph heap dump"? You should
find some files in the same directory as your monitor log files that
you can analyze using tcmalloc's tools (see
http://ceph.com/wiki/Memory_Profiling for some info), or send them our
way for analysis.
This will go easier if you also have Ceph's debug symbol packages installed. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html