Re: question on mon memory usage

Travis Rhoden <trhoden@xxxxxxxxx> · Mon, 25 Feb 2013 19:31:37 -0500

Joao,

Happy to help if I can.  responses inline.

On Mon, Feb 25, 2013 at 4:05 PM, Joao Eduardo Luis
<joao.luis@xxxxxxxxxxx> wrote:
> On 02/25/2013 07:59 PM, Travis Rhoden wrote:
>>
>> Hi folks,
>>
>> A question about memory usage by the Mon.  I have a cluster that is
>> being used exclusively for RBD (no CephFS/mds).  I have 5 mons, and
>> one is slowly but surely using a heck of a lot more memory than the
>> others:
>>
>> # for x in ceph{1..5}; do ssh $x 'ps aux | grep ceph-mon | grep -v grep';
>> done
>> root     31034  5.2  0.1 312116 75516 ?        Ssl  Feb14 881:51
>> /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c
>> /etc/ceph/ceph.conf
>> root     29361  4.8 53.9 22526128 22238080 ?   Ssl  Feb14 822:36
>> /usr/bin/ceph-mon -i b --pid-file /var/run/ceph/mon.b.pid -c
>> /tmp/ceph.conf.31144
>> root     28421  7.0  0.1 273608 88608 ?        Ssl  Feb20 516:48
>> /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c
>> /tmp/ceph.conf.10625
>> root     25876  4.8  0.1 240752 84048 ?        Ssl  Feb14 816:54
>> /usr/bin/ceph-mon -i d --pid-file /var/run/ceph/mon.d.pid -c
>> /tmp/ceph.conf.31537
>> root     24505  4.8  0.1 228720 79284 ?        Ssl  Feb14 818:14
>> /usr/bin/ceph-mon -i e --pid-file /var/run/ceph/mon.e.pid -c
>> /tmp/ceph.conf.31734
>>
>> As you can see, one is up over 20GB, while the others are < 100MB.
>>
>> Is this normal?  The box has plenty of RAM -- I'm wondering if this is
>> a memory leak, or if it's just slowly finding more things it can cache
>> and such.
>>
>
> Hi Travis,
>
> Which version are you running?
>
# ceph --version
ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)

That's the case all around OSDs, mons, librbd clients, everything in my cluster
> This has been something that pops in the list every now and then, and I've
> spent a considerable amount of time trying to track it down.
>
> My current suspicion lies on the in-memory pgmap growing, and growing, and
> growing... and it usually hits the leader the worst.  Can you please confirm
> that mon.b is indeed the leader?
I'm not 100% sure how to do that.  I'm guessing rank 0 from the
following output?

# ceph quorum_status
{ "election_epoch": 32,
  "quorum": [
        0,
        1,
        2,
        3,
        4],
  "monmap": { "epoch": 1,
      "fsid": "d5229b51-5321-48d2-bbb2-16062abb1992",
      "modified": "2013-01-21 17:58:14.389411",
      "created": "2013-01-21 17:58:14.389411",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "10.10.30.1:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "10.10.30.2:6789\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "10.10.30.3:6789\/0"},
            { "rank": 3,
              "name": "d",
              "addr": "10.10.30.4:6789\/0"},
            { "rank": 4,
              "name": "e",
              "addr": "10.10.30.5:6789\/0"}]}}

That would seem to imply that mon a is the leader.  mon b is
definitely the problem child at the moment.

I did a quick check, and mon b has grown by ~ 400MB since my previous
email.  So we're looking at a little under 100MB/hr, perhaps.  Not
sure if that's consistent or not.  WIll certainly check again in the
morning.

>
> Also, a 'ceph -s' would be appreciated.
>
# ceph -s
   health HEALTH_OK
   monmap e1: 5 mons at
{a=10.10.30.1:6789/0,b=10.10.30.2:6789/0,c=10.10.30.3:6789/0,d=10.10.30.4:6789/0,e=10.10.30.5:6789/0},
election epoch 32, quorum 0,1,2,3,4 a,b,c,d,e
   osdmap e466: 60 osds: 60 up, 60 in
    pgmap v643826: 23808 pgs: 23808 active+clean; 609 GB data, 1249 GB
used, 107 TB / 109 TB avail; 0B/s rd, 4023B/s wr, 1op/s
   mdsmap e1: 0/0/1 up
>
>   -Joao
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com