On Sun, May 6, 2012 at 5:53 PM, Vladimir Bashkirtsev <vladimir@xxxxxxxxxxxxxxx> wrote: > On 03/05/12 16:23, Greg Farnum wrote: >> >> On Wednesday, May 2, 2012 at 11:24 PM, Vladimir Bashkirtsev wrote: >>> >>> Greg, >>> >>> Apologies for multiple emails: my mail server is backed by ceph now and >>> it struggled this morning (separate issue). So my mail server reported >>> back to my mailer that sending of email failed when obviously it was not >>> the case. >> >> Interesting — I presume you're using the file system? That's not something >> we've heard of anybody doing with Ceph before. :) >> >>> >>> [root@gamma ~]# ceph -s >>> 2012-05-03 15:46:55.640951 mds e2666: 1/1/1 up {0=1=up:active}, 1 >>> up:standby >>> 2012-05-03 15:46:55.647106 osd e10728: 6 osds: 6 up, 6 in >>> 2012-05-03 15:46:55.654052 log 2012-05-03 15:46:26.557084 mon.2 >>> 172.16.64.202:6789/0 2878 : [INF] mon.2 calling new monitor election >>> 2012-05-03 15:46:55.654425 mon e7: 3 mons at >>> {0=172.16.64.200:6789/0,1=172.16.64.201:6789/0,2=172.16.64.202:6789/0} >>> 2012-05-03 15:46:56.961624 pg v1251669: 600 pgs: 2 creating, 598 >>> active+clean; 309 GB data, 963 GB used, 1098 GB / 2145 GB avail >>> >>> Loggin is on but nothing obvious in there: logs quite small. Number of >>> ceph health logged (ceph monitored by nagios and so this record appears >>> every 5 minutes), monitors periodically call for election (different >>> periods between 1 to 15 minutes as it looks). That's it. >> >> Hrm. Generally speaking the monitors shouldn't call for elections unless >> something changes (one of them crashes) or the leader monitor is slowing >> down. >> Can you increase the debug_mon to 20, the debug_ms to 1, and post one of >> the logs somewhere? The "Live Debugging" section of >> http://ceph.com/wiki/Debugging should give you what you need. :) > > Here's the logs and core dumps: > http://www.bashkirtsev.com/logs-2012-05-07.tar.bz2 > > Mons grown to 1.2GB and 2GB of memory. When I look at the logs for mon.0, I see that there are a lot of places where mon.0 takes tens of seconds to write something to disk. If the disk is just about full, that might make sense (many filesystems don't handle a nearly-full disk very well at all); and a monitor getting stuck for that long could definitely explain why they start using up so much memory (they're buffering messages). I suspect that there's not anything particularly wrong here, unless I'm misunderstanding the story you're telling me. :) Have you noticed this problem when the monitor's disk partition isn't nearly full? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html