ceph stability

Roman Hlynovskiy <roman.hlynovskiy@xxxxxxxxx> · Wed, 19 Dec 2012 15:03:08 +0600

Hello,

I have 2 issues with ceph stability and looking for help to resolve them.
My setup is pretty simple - 3 debian 32bit stable systems each running
osd, mon and mds.
the conf is the following:
--------------------
[global]
    auth cluster required = none
    auth service required = none
    auth client required = none

[osd]
    osd journal size = 1000
    filestore xattr use omap = true

[mon.a]
    host = ceph-node01
    mon addr = 192.168.7.11:6789

[mon.b]
    host = ceph-node02
    mon addr = 192.168.7.12:6789

[mon.c]
        host = ceph-node03
        mon addr = 192.168.7.13:6789

[mds.a]
        host = ceph-node01

[mds.b]
        host = ceph-node02

[mds.c]
        host = ceph-node03

[osd.0]
    host = ceph-node01

[osd.1]
    host = ceph-node02

[osd.2]
    host = ceph-node03
--------------------
ceph -s is:
   health HEALTH_OK
   monmap e4: 3 mons at
{a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0},
election epoch 118, quorum 0,1,2 a,b,c
   osdmap e197: 3 osds: 3 up, 3 in
    pgmap v43305: 384 pgs: 384 active+clean; 72351 MB data, 144 GB
used, 105 GB / 249 GB avail
   mdsmap e4439: 1/1/1 up {0=a=up:active}, 2 up:standby
--------------------

My first problem - I am getting spurious mon's deaths, which usually
looks like this:

--- begin dump of recent events ---
     0> 2012-12-19 10:35:58.912119 b41eab70 -1 *** Caught signal (Aborted) **
 in thread b41eab70

 ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b)
 1: /usr/bin/ceph-mon() [0x8183a11]
 2: [0xb7714400]
 3: (gsignal()+0x47) [0xb7337577]
 4: (abort()+0x182) [0xb733a962]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x14f) [0xb755653f]
 6: (()+0xbd405) [0xb7554405]
 7: (()+0xbd442) [0xb7554442]
 8: (()+0xbd581) [0xb7554581]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80f) [0x824cabf]
 10: /usr/bin/ceph-mon() [0x80e3c1d]
 11: (MDSMonitor::tick()+0x1e3b) [0x811ea0b]
 12: (MDSMonitor::on_active()+0x1d) [0x81188dd]
 13: (PaxosService::_active()+0x212) [0x80e4b02]
 14: (Context::complete(int)+0x19) [0x80c4cf9]
 15: (finish_contexts(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0x13f) [0x80d208f]
 16: (Monitor::recovered_leader(int)+0x3ac) [0x80ac5ac]
 17: (Paxos::handle_last(MMonPaxos*)+0xb02) [0x80e0572]
 18: (Paxos::dispatch(PaxosServiceMessage*)+0x2c4) [0x80e0e94]
 19: (Monitor::_ms_dispatch(Message*)+0x1181) [0x80c3b11]
 20: (Monitor::ms_dispatch(Message*)+0x31) [0x80d5021]
 21: (DispatchQueue::entry()+0x337) [0x82afa47]
 22: (DispatchQueue::DispatchThread::entry()+0x20) [0x823eec0]
 23: (Thread::_entry_func(void*)+0x11) [0x824be41]
 24: (()+0x57b0) [0xb75ef7b0]
 25: (clone()+0x5e) [0xb73d8cde]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent    100000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.a.log
--- end dump of recent events ---

the binaries are coming from ceph.com debian-testing repo.

My second problem - I have 2 systems which mount ceph. Whenever I
mount ceph on any other system it usually mounts but get stuck on
stat* operations (i.e. simple ls -al will hang with read( from the
ceph-mounted directory for ages). This kind of client stuck also
affects two working clients. they also start to stuck on the stat*
even after shutdown of the third client. so usually umount/mount or
even reboot for existing clients solves the issue)

--
...WBR, Roman Hlynovskiy
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html