Re: question on mon memory usage

Travis Rhoden <trhoden@xxxxxxxxx> · Mon, 25 Feb 2013 19:48:24 -0500

Right now everything is on a stock setup.   I believe that means no core file.

root@ceph2:~# ulimit -c
0

Doh.  I don't see anything in the ceph init script that would increase
this for the ceph-* processes.  Which is probably a good thing, of
course.

On Mon, Feb 25, 2013 at 7:40 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Mon, 25 Feb 2013, Travis Rhoden wrote:
>> Joao,
>>
>> Happy to help if I can.  responses inline.
>>
>> On Mon, Feb 25, 2013 at 4:05 PM, Joao Eduardo Luis
>> <joao.luis@xxxxxxxxxxx> wrote:
>> > On 02/25/2013 07:59 PM, Travis Rhoden wrote:
>> >>
>> >> Hi folks,
>> >>
>> >> A question about memory usage by the Mon.  I have a cluster that is
>> >> being used exclusively for RBD (no CephFS/mds).  I have 5 mons, and
>> >> one is slowly but surely using a heck of a lot more memory than the
>> >> others:
>> >>
>> >> # for x in ceph{1..5}; do ssh $x 'ps aux | grep ceph-mon | grep -v grep';
>> >> done
>> >> root     31034  5.2  0.1 312116 75516 ?        Ssl  Feb14 881:51
>> >> /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c
>> >> /etc/ceph/ceph.conf
>> >> root     29361  4.8 53.9 22526128 22238080 ?   Ssl  Feb14 822:36
>> >> /usr/bin/ceph-mon -i b --pid-file /var/run/ceph/mon.b.pid -c
>> >> /tmp/ceph.conf.31144
>> >> root     28421  7.0  0.1 273608 88608 ?        Ssl  Feb20 516:48
>> >> /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c
>> >> /tmp/ceph.conf.10625
>> >> root     25876  4.8  0.1 240752 84048 ?        Ssl  Feb14 816:54
>> >> /usr/bin/ceph-mon -i d --pid-file /var/run/ceph/mon.d.pid -c
>> >> /tmp/ceph.conf.31537
>> >> root     24505  4.8  0.1 228720 79284 ?        Ssl  Feb14 818:14
>> >> /usr/bin/ceph-mon -i e --pid-file /var/run/ceph/mon.e.pid -c
>> >> /tmp/ceph.conf.31734
>> >>
>> >> As you can see, one is up over 20GB, while the others are < 100MB.
>> >>
>> >> Is this normal?  The box has plenty of RAM -- I'm wondering if this is
>> >> a memory leak, or if it's just slowly finding more things it can cache
>> >> and such.
>> >>
>> >
>> > Hi Travis,
>> >
>> > Which version are you running?
>> >
>> # ceph --version
>> ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
>>
>> That's the case all around OSDs, mons, librbd clients, everything in my cluster
>> > This has been something that pops in the list every now and then, and I've
>> > spent a considerable amount of time trying to track it down.
>> >
>> > My current suspicion lies on the in-memory pgmap growing, and growing, and
>> > growing... and it usually hits the leader the worst.  Can you please confirm
>> > that mon.b is indeed the leader?
>> I'm not 100% sure how to do that.  I'm guessing rank 0 from the
>> following output?
>>
>> # ceph quorum_status
>> { "election_epoch": 32,
>>   "quorum": [
>>         0,
>>         1,
>>         2,
>>         3,
>>         4],
>>   "monmap": { "epoch": 1,
>>       "fsid": "d5229b51-5321-48d2-bbb2-16062abb1992",
>>       "modified": "2013-01-21 17:58:14.389411",
>>       "created": "2013-01-21 17:58:14.389411",
>>       "mons": [
>>             { "rank": 0,
>>               "name": "a",
>>               "addr": "10.10.30.1:6789\/0"},
>>             { "rank": 1,
>>               "name": "b",
>>               "addr": "10.10.30.2:6789\/0"},
>>             { "rank": 2,
>>               "name": "c",
>>               "addr": "10.10.30.3:6789\/0"},
>>             { "rank": 3,
>>               "name": "d",
>>               "addr": "10.10.30.4:6789\/0"},
>>             { "rank": 4,
>>               "name": "e",
>>               "addr": "10.10.30.5:6789\/0"}]}}
>>
>> That would seem to imply that mon a is the leader.  mon b is
>> definitely the problem child at the moment.
>>
>> I did a quick check, and mon b has grown by ~ 400MB since my previous
>> email.  So we're looking at a little under 100MB/hr, perhaps.  Not
>> sure if that's consistent or not.  WIll certainly check again in the
>> morning.
>
> Do you know if there is a core file ulimit set on that process?  If the
> core is configured to go somewhere, a kill -SEGV on it would generate a
> core that would help us figure out what the memory is consumed by.
>
> Thanks!
> sage
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com