On Mon, 25 Feb 2013, Travis Rhoden wrote: > Hi Sage, > > I gave that script a try. Interestingly, I ended up with a core file > from gdb itself. > > # file core > core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, > from 'gdb --batch --pid 29361 -ex dump memory > 29361-04bbb000-57bf58000.dump 0x04bbb00' > > So I think gdb crashed. But before that happened, I did get 195M of > output. However, I was expecting a full 20+ GBs. Not sure if what I > generated can be of use or not. If so, I can tar and compress it all > and place it somewhere useful if you like. At it's current size, I > could host it in dropbox for you to pull down. At 20GB (if that had > worked) I would need a place to scp it. Argh. Try this: http://ceph.com/qa/dump_proc_mem.txt it takes one argument (the pid).. pipe it to a file, bzip2, and post somewhere. Hopefully that'll do the trick... sage > > - Travis > > On Mon, Feb 25, 2013 at 8:12 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > > On Mon, 25 Feb 2013, Travis Rhoden wrote: > >> Right now everything is on a stock setup. I believe that means no core file. > >> > >> root@ceph2:~# ulimit -c > >> 0 > >> > >> Doh. I don't see anything in the ceph init script that would increase > >> this for the ceph-* processes. Which is probably a good thing, of > >> course. > > > > Can you try something like this to grab an image of the process memory? > > > > #!/bin/bash > > grep rw-p /proc/$1/maps | sed -n 's/^\([0-9a-f]*\)-\([0-9a-f]*\) .*$/\1 \2/p' | while read start stop; do gdb --batch --pid $1 -ex "dump memory $1-$start-$stop.dump 0x$start 0x$stop"; done > > > > (from http://stackoverflow.com/questions/12977179/reading-living-process-memory-without-interrupting-it-proc-kcore-is-an-option) > > > > THanks! > > sage > > > > > > > > > > > >> > >> On Mon, Feb 25, 2013 at 7:40 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >> > On Mon, 25 Feb 2013, Travis Rhoden wrote: > >> >> Joao, > >> >> > >> >> Happy to help if I can. responses inline. > >> >> > >> >> On Mon, Feb 25, 2013 at 4:05 PM, Joao Eduardo Luis > >> >> <joao.luis@xxxxxxxxxxx> wrote: > >> >> > On 02/25/2013 07:59 PM, Travis Rhoden wrote: > >> >> >> > >> >> >> Hi folks, > >> >> >> > >> >> >> A question about memory usage by the Mon. I have a cluster that is > >> >> >> being used exclusively for RBD (no CephFS/mds). I have 5 mons, and > >> >> >> one is slowly but surely using a heck of a lot more memory than the > >> >> >> others: > >> >> >> > >> >> >> # for x in ceph{1..5}; do ssh $x 'ps aux | grep ceph-mon | grep -v grep'; > >> >> >> done > >> >> >> root 31034 5.2 0.1 312116 75516 ? Ssl Feb14 881:51 > >> >> >> /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c > >> >> >> /etc/ceph/ceph.conf > >> >> >> root 29361 4.8 53.9 22526128 22238080 ? Ssl Feb14 822:36 > >> >> >> /usr/bin/ceph-mon -i b --pid-file /var/run/ceph/mon.b.pid -c > >> >> >> /tmp/ceph.conf.31144 > >> >> >> root 28421 7.0 0.1 273608 88608 ? Ssl Feb20 516:48 > >> >> >> /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c > >> >> >> /tmp/ceph.conf.10625 > >> >> >> root 25876 4.8 0.1 240752 84048 ? Ssl Feb14 816:54 > >> >> >> /usr/bin/ceph-mon -i d --pid-file /var/run/ceph/mon.d.pid -c > >> >> >> /tmp/ceph.conf.31537 > >> >> >> root 24505 4.8 0.1 228720 79284 ? Ssl Feb14 818:14 > >> >> >> /usr/bin/ceph-mon -i e --pid-file /var/run/ceph/mon.e.pid -c > >> >> >> /tmp/ceph.conf.31734 > >> >> >> > >> >> >> As you can see, one is up over 20GB, while the others are < 100MB. > >> >> >> > >> >> >> Is this normal? The box has plenty of RAM -- I'm wondering if this is > >> >> >> a memory leak, or if it's just slowly finding more things it can cache > >> >> >> and such. > >> >> >> > >> >> > > >> >> > Hi Travis, > >> >> > > >> >> > Which version are you running? > >> >> > > >> >> # ceph --version > >> >> ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5) > >> >> > >> >> That's the case all around OSDs, mons, librbd clients, everything in my cluster > >> >> > This has been something that pops in the list every now and then, and I've > >> >> > spent a considerable amount of time trying to track it down. > >> >> > > >> >> > My current suspicion lies on the in-memory pgmap growing, and growing, and > >> >> > growing... and it usually hits the leader the worst. Can you please confirm > >> >> > that mon.b is indeed the leader? > >> >> I'm not 100% sure how to do that. I'm guessing rank 0 from the > >> >> following output? > >> >> > >> >> # ceph quorum_status > >> >> { "election_epoch": 32, > >> >> "quorum": [ > >> >> 0, > >> >> 1, > >> >> 2, > >> >> 3, > >> >> 4], > >> >> "monmap": { "epoch": 1, > >> >> "fsid": "d5229b51-5321-48d2-bbb2-16062abb1992", > >> >> "modified": "2013-01-21 17:58:14.389411", > >> >> "created": "2013-01-21 17:58:14.389411", > >> >> "mons": [ > >> >> { "rank": 0, > >> >> "name": "a", > >> >> "addr": "10.10.30.1:6789\/0"}, > >> >> { "rank": 1, > >> >> "name": "b", > >> >> "addr": "10.10.30.2:6789\/0"}, > >> >> { "rank": 2, > >> >> "name": "c", > >> >> "addr": "10.10.30.3:6789\/0"}, > >> >> { "rank": 3, > >> >> "name": "d", > >> >> "addr": "10.10.30.4:6789\/0"}, > >> >> { "rank": 4, > >> >> "name": "e", > >> >> "addr": "10.10.30.5:6789\/0"}]}} > >> >> > >> >> That would seem to imply that mon a is the leader. mon b is > >> >> definitely the problem child at the moment. > >> >> > >> >> I did a quick check, and mon b has grown by ~ 400MB since my previous > >> >> email. So we're looking at a little under 100MB/hr, perhaps. Not > >> >> sure if that's consistent or not. WIll certainly check again in the > >> >> morning. > >> > > >> > Do you know if there is a core file ulimit set on that process? If the > >> > core is configured to go somewhere, a kill -SEGV on it would generate a > >> > core that would help us figure out what the memory is consumed by. > >> > > >> > Thanks! > >> > sage > >> > > >> > >> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com