Joao, Happy to help if I can. responses inline. On Mon, Feb 25, 2013 at 4:05 PM, Joao Eduardo Luis <joao.luis@xxxxxxxxxxx> wrote: > On 02/25/2013 07:59 PM, Travis Rhoden wrote: >> >> Hi folks, >> >> A question about memory usage by the Mon. I have a cluster that is >> being used exclusively for RBD (no CephFS/mds). I have 5 mons, and >> one is slowly but surely using a heck of a lot more memory than the >> others: >> >> # for x in ceph{1..5}; do ssh $x 'ps aux | grep ceph-mon | grep -v grep'; >> done >> root 31034 5.2 0.1 312116 75516 ? Ssl Feb14 881:51 >> /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c >> /etc/ceph/ceph.conf >> root 29361 4.8 53.9 22526128 22238080 ? Ssl Feb14 822:36 >> /usr/bin/ceph-mon -i b --pid-file /var/run/ceph/mon.b.pid -c >> /tmp/ceph.conf.31144 >> root 28421 7.0 0.1 273608 88608 ? Ssl Feb20 516:48 >> /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c >> /tmp/ceph.conf.10625 >> root 25876 4.8 0.1 240752 84048 ? Ssl Feb14 816:54 >> /usr/bin/ceph-mon -i d --pid-file /var/run/ceph/mon.d.pid -c >> /tmp/ceph.conf.31537 >> root 24505 4.8 0.1 228720 79284 ? Ssl Feb14 818:14 >> /usr/bin/ceph-mon -i e --pid-file /var/run/ceph/mon.e.pid -c >> /tmp/ceph.conf.31734 >> >> As you can see, one is up over 20GB, while the others are < 100MB. >> >> Is this normal? The box has plenty of RAM -- I'm wondering if this is >> a memory leak, or if it's just slowly finding more things it can cache >> and such. >> > > Hi Travis, > > Which version are you running? > # ceph --version ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5) That's the case all around OSDs, mons, librbd clients, everything in my cluster > This has been something that pops in the list every now and then, and I've > spent a considerable amount of time trying to track it down. > > My current suspicion lies on the in-memory pgmap growing, and growing, and > growing... and it usually hits the leader the worst. Can you please confirm > that mon.b is indeed the leader? I'm not 100% sure how to do that. I'm guessing rank 0 from the following output? # ceph quorum_status { "election_epoch": 32, "quorum": [ 0, 1, 2, 3, 4], "monmap": { "epoch": 1, "fsid": "d5229b51-5321-48d2-bbb2-16062abb1992", "modified": "2013-01-21 17:58:14.389411", "created": "2013-01-21 17:58:14.389411", "mons": [ { "rank": 0, "name": "a", "addr": "10.10.30.1:6789\/0"}, { "rank": 1, "name": "b", "addr": "10.10.30.2:6789\/0"}, { "rank": 2, "name": "c", "addr": "10.10.30.3:6789\/0"}, { "rank": 3, "name": "d", "addr": "10.10.30.4:6789\/0"}, { "rank": 4, "name": "e", "addr": "10.10.30.5:6789\/0"}]}} That would seem to imply that mon a is the leader. mon b is definitely the problem child at the moment. I did a quick check, and mon b has grown by ~ 400MB since my previous email. So we're looking at a little under 100MB/hr, perhaps. Not sure if that's consistent or not. WIll certainly check again in the morning. > > Also, a 'ceph -s' would be appreciated. > # ceph -s health HEALTH_OK monmap e1: 5 mons at {a=10.10.30.1:6789/0,b=10.10.30.2:6789/0,c=10.10.30.3:6789/0,d=10.10.30.4:6789/0,e=10.10.30.5:6789/0}, election epoch 32, quorum 0,1,2,3,4 a,b,c,d,e osdmap e466: 60 osds: 60 up, 60 in pgmap v643826: 23808 pgs: 23808 active+clean; 609 GB data, 1249 GB used, 107 TB / 109 TB avail; 0B/s rd, 4023B/s wr, 1op/s mdsmap e1: 0/0/1 up > > -Joao _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com