I deployed Ceph with chef, but sometimes Monitors failed to join the cluster. The setup steps: First, I deployed monitors in two hosts(lc001 and lc003) and I succeeded. Then, I added two Monitors (lc002 and lc004)to the cluster about 30 minutes later. I used the same ceph-cookbook,but three different results came out: Result #1: the cluster was deployed successfully. Result #2: both Monitors could not join the cluster. Result #3: one of them could not join the cluster. Mon log of for Result #2 and Result #3: Result #2? lc001:(mon leader) 2014-08-12 23:47:37.214102 7fe90eb527a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2616 2014-08-12 23:47:38.023315 7fbb05d857a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2838 2014-08-12 23:47:38.061127 7fbb05d857a0 1 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd 2014-08-12 23:47:38.061803 7fbb05d857a0 0 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0 my rank is now 0 (was -1) 2014-08-12 23:47:38.061837 7fbb05d857a0 1 mon.lc001 at 0(probing)<mailto:mon.lc001 at 0(probing)> e0 win_standalone_election 2014-08-12 23:47:38.064423 7fbb05d857a0 0 log [INF] : mon.lc001 at 0<mailto:mon.lc001 at 0> won leader election with quorum 0 lc002? 2014-08-13 00:19:20.398076 7f70de3707a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 5931 2014-08-13 00:19:22.689726 7f9e203257a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 6516 2014-08-13 00:19:22.750096 7f9e203257a0 1 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd 2014-08-13 00:19:22.751250 7f9e203257a0 0 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0 my rank is now 1 (was -1) 2014-08-13 00:19:22.754362 7f9e1b9c3700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption Result #3? lc001:(mon leader) 2014-08-12 21:00:37.066616 7f9e9f6a17a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2610 2014-08-12 21:00:37.922371 7fa9d44117a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2832 2014-08-12 21:00:37.959359 7fa9d44117a0 1 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd 2014-08-12 21:00:37.960038 7fa9d44117a0 0 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0 my rank is now 0 (was -1) 2014-08-12 21:00:37.960073 7fa9d44117a0 1 mon.lc001 at 0(probing)<mailto:mon.lc001 at 0(probing)> e0 win_standalone_election 2014-08-12 21:00:37.962906 7fa9d44117a0 0 log [INF] : mon.lc001 at 0<mailto:mon.lc001 at 0> won leader election with quorum 0 lc002: 2014-08-12 21:51:34.067996 7f52cef727a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3017 2014-08-12 21:51:34.813955 7f112c0bb7a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3239 2014-08-12 21:51:34.871537 7f112c0bb7a0 1 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd 2014-08-12 21:51:34.872215 7f112c0bb7a0 0 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0 my rank is now 1 (was -1) 2014-08-12 21:51:34.879046 7f1128c5c700 0 mon.lc002 at 1(probing)<mailto:mon.lc002 at 1(probing)> e2 my rank is now -1 (was 1) lc004: 2014-08-12 22:07:49.232619 7f1cdeabf7a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3041 2014-08-12 22:07:49.971136 7fc8f38097a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3263 2014-08-12 22:07:50.030808 7fc8f38097a0 1 mon.lc004 at -1(probing)<mailto:mon.lc004 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd 2014-08-12 22:07:50.031491 7fc8f38097a0 0 mon.lc004 at -1(probing)<mailto:mon.lc004 at -1(probing)> e0 my rank is now 3 (was -1) 2014-08-12 22:07:50.036596 7fc8eefa8700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption According to the logs, the failure was due to the cephx. But all nodes had the same admin sercet file before ceph package was installed excpet the first Monitor host. If I set auth to none,the cluster could be deployed successfully. Some information may be helpful to find the problem: [root at lc002 ~]# cat /etc/ceph/ceph.conf [global] fsid = 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd mon initial members = mon host = 10.1.4.30:6789, 10.1.4.32:6789 auth client required = cephx auth cluster required = cephx auth service required = cephx ms tcp read timeout = 6000 osd pool default pg num = 333 osd pool default pgp num = 333 osd pool default size = 2 public network = 10.1.0.0/16 [osd] osd journal size = 10000 [root at lc002 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.lc002.asok mon_status { "name": "lc002", "rank": 1, "state": "probing", "election_epoch": 0, "quorum": [], "outside_quorum": [ "lc002"], "extra_probe_peers": [ "10.1.4.30:6789\/0", "10.1.4.31:6789\/0", "10.1.4.32:6789\/0"], "sync_provider": [], "monmap": { "epoch": 0, "fsid": "71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd", "modified": "0.000000", "created": "0.000000", "mons": [ { "rank": 0, "name": "noname-a", "addr": "10.1.4.30:6789\/0"}, { "rank": 1, "name": "lc002", "addr": "10.1.4.31:6789\/0"}, { "rank": 2, "name": "noname-c", "addr": "10.1.4.32:6789\/0"}]}} [root at lc002 ~]# ceph auth list installed auth entries: osd.0 key: AQB4N+pTgN5KCxAADtqDkbaMvjMsMPNQ1IwsRQ== caps: [mon] allow profile osd caps: [osd] allow * osd.1 key: AQB+N+pTyGlRLBAAyy8FlV/CXOdCmCoq64dzwQ== caps: [mon] allow profile osd caps: [osd] allow * osd.10 key: AQB+PupTQLY9AxAAbbEYrvlpQlA7T5zwNxAqQg== caps: [mon] allow profile osd caps: [osd] allow * osd.11 key: AQCLPupTCM07EhAAljtA6ZZC/0W5J4M1oteCQQ== caps: [mon] allow profile osd caps: [osd] allow * osd.2 key: AQCFN+pT2OpPFRAAls3Af7M+cSG+Hg4Jl5KFXQ== caps: [mon] allow profile osd caps: [osd] allow * osd.3 key: AQCLN+pTCEH+JhAAs3r2HDZHlXe2lwzbYCDSkg== caps: [mon] allow profile osd caps: [osd] allow * osd.4 key: AQAAOOpTAHq2ARAA14mq+DUN8mJK46Xx9361Og== caps: [mon] allow profile osd caps: [osd] allow * osd.5 key: AQAHOOpT0PILCRAAMEaJw45tzIZHUGluRASbmA== caps: [mon] allow profile osd caps: [osd] allow * osd.6 key: AQANOOpTEOs2NxAAYOwaS+M/4kH5Jp0bNSFCbg== caps: [mon] allow profile osd caps: [osd] allow * osd.7 key: AQAUOOpTuJvaKxAALIOMKVDVDOfjgrKNc+CVfg== caps: [mon] allow profile osd caps: [osd] allow * osd.8 key: AQBtPupT4FQQHxAAWbueUsrAZkik4gQXH/sAsw== caps: [mon] allow profile osd caps: [osd] allow * osd.9 key: AQBzPupT6B6LMRAAzShzIYfV7dLihMGlICGG6A== caps: [mon] allow profile osd caps: [osd] allow * client.admin key: AQAaN+pTsHY1JhAAtC7wWJ0Nvvizmhu7j/loNA== caps: [mds] allow caps: [mon] allow * caps: [osd] allow * client.bootstrap-mds key: AQAbN+pTWCtKDxAA2vAgQuhd6gUR16BG/eH39A== caps: [mon] allow profile bootstrap-mds client.bootstrap-osd key: AQAaN+pTMOj/NxAA01Wg3zznxXaPtLYYMKOBhQ== caps: [mon] allow profile bootstrap-osd [root at lc002 ~]# cat /etc/ceph/ceph.client.admin.keyring [client.admin] key = AQAaN+pTsHY1JhAAtC7wWJ0Nvvizmhu7j/loNA== [root at lc002 ~]# ceph -s cluster 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd health HEALTH_WARN too few pgs per osd (16 < min 20); clock skew detected on mon.lc003 monmap e2: 2 mons at {lc001=10.1.4.30:6789/0,lc003=10.1.4.32:6789/0}, election epoch 4, quorum 0,1 lc001,lc003 osdmap e54: 12 osds: 12 up, 12 in pgmap v106: 192 pgs, 3 pools, 0 bytes data, 0 objects 410 MB used, 452 GB / 452 GB avail 192 active+clean -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140813/cb1fb052/attachment.htm>