Sometimes Monitors failed to join the cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I deployed Ceph with chef, but sometimes Monitors failed to join the cluster.
The setup steps:
First, I deployed monitors in two hosts(lc001 and lc003) and I succeeded.
Then, I added two Monitors (lc002 and lc004)to the cluster about 30 minutes later.

I used the same ceph-cookbook,but three different results came out:
Result #1: the cluster was deployed successfully.
Result #2: both Monitors could not join the cluster.
Result #3: one of them could not join the cluster.

Mon log of for Result #2 and Result #3:
Result #2?
lc001:(mon leader)
2014-08-12 23:47:37.214102 7fe90eb527a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2616
2014-08-12 23:47:38.023315 7fbb05d857a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2838
2014-08-12 23:47:38.061127 7fbb05d857a0  1 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd
2014-08-12 23:47:38.061803 7fbb05d857a0  0 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0  my rank is now 0 (was -1)
2014-08-12 23:47:38.061837 7fbb05d857a0  1 mon.lc001 at 0(probing)<mailto:mon.lc001 at 0(probing)> e0 win_standalone_election
2014-08-12 23:47:38.064423 7fbb05d857a0  0 log [INF] : mon.lc001 at 0<mailto:mon.lc001 at 0> won leader election with quorum 0
lc002?
2014-08-13 00:19:20.398076 7f70de3707a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 5931
2014-08-13 00:19:22.689726 7f9e203257a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 6516
2014-08-13 00:19:22.750096 7f9e203257a0  1 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd
2014-08-13 00:19:22.751250 7f9e203257a0  0 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0  my rank is now 1 (was -1)
2014-08-13 00:19:22.754362 7f9e1b9c3700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
Result #3?
lc001:(mon leader)
2014-08-12 21:00:37.066616 7f9e9f6a17a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2610
2014-08-12 21:00:37.922371 7fa9d44117a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 2832
2014-08-12 21:00:37.959359 7fa9d44117a0  1 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd
2014-08-12 21:00:37.960038 7fa9d44117a0  0 mon.lc001 at -1(probing)<mailto:mon.lc001 at -1(probing)> e0  my rank is now 0 (was -1)
2014-08-12 21:00:37.960073 7fa9d44117a0  1 mon.lc001 at 0(probing)<mailto:mon.lc001 at 0(probing)> e0 win_standalone_election
2014-08-12 21:00:37.962906 7fa9d44117a0  0 log [INF] : mon.lc001 at 0<mailto:mon.lc001 at 0> won leader election with quorum 0
lc002:
2014-08-12 21:51:34.067996 7f52cef727a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3017
2014-08-12 21:51:34.813955 7f112c0bb7a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3239
2014-08-12 21:51:34.871537 7f112c0bb7a0  1 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd
2014-08-12 21:51:34.872215 7f112c0bb7a0  0 mon.lc002 at -1(probing)<mailto:mon.lc002 at -1(probing)> e0  my rank is now 1 (was -1)
2014-08-12 21:51:34.879046 7f1128c5c700  0 mon.lc002 at 1(probing)<mailto:mon.lc002 at 1(probing)> e2  my rank is now -1 (was 1)
lc004:
2014-08-12 22:07:49.232619 7f1cdeabf7a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3041
2014-08-12 22:07:49.971136 7fc8f38097a0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 3263
2014-08-12 22:07:50.030808 7fc8f38097a0  1 mon.lc004 at -1(probing)<mailto:mon.lc004 at -1(probing)> e0 preinit fsid 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd
2014-08-12 22:07:50.031491 7fc8f38097a0  0 mon.lc004 at -1(probing)<mailto:mon.lc004 at -1(probing)> e0  my rank is now 3 (was -1)
2014-08-12 22:07:50.036596 7fc8eefa8700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption

According to the logs, the failure was due to the cephx. But all nodes had the same admin sercet file before ceph package was installed excpet the first Monitor host.
If I set auth to none,the cluster could be deployed successfully.
Some information may be helpful to find the problem:
[root at lc002 ~]# cat /etc/ceph/ceph.conf
 [global]
  fsid = 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd
  mon initial members =
  mon host = 10.1.4.30:6789, 10.1.4.32:6789
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  ms tcp read timeout = 6000
  osd pool default pg num = 333
  osd pool default pgp num = 333
  osd pool default size = 2
  public network = 10.1.0.0/16

[osd]
  osd journal size = 10000


[root at lc002 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.lc002.asok mon_status
{ "name": "lc002",
  "rank": 1,
  "state": "probing",
  "election_epoch": 0,
  "quorum": [],
  "outside_quorum": [
        "lc002"],
  "extra_probe_peers": [
        "10.1.4.30:6789\/0",
        "10.1.4.31:6789\/0",
        "10.1.4.32:6789\/0"],
  "sync_provider": [],
  "monmap": { "epoch": 0,
      "fsid": "71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd",
      "modified": "0.000000",
      "created": "0.000000",
      "mons": [
            { "rank": 0,
              "name": "noname-a",
              "addr": "10.1.4.30:6789\/0"},
            { "rank": 1,
              "name": "lc002",
              "addr": "10.1.4.31:6789\/0"},
            { "rank": 2,
              "name": "noname-c",
              "addr": "10.1.4.32:6789\/0"}]}}

[root at lc002 ~]# ceph auth list
installed auth entries:

osd.0
       key: AQB4N+pTgN5KCxAADtqDkbaMvjMsMPNQ1IwsRQ==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.1
       key: AQB+N+pTyGlRLBAAyy8FlV/CXOdCmCoq64dzwQ==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.10
       key: AQB+PupTQLY9AxAAbbEYrvlpQlA7T5zwNxAqQg==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.11
       key: AQCLPupTCM07EhAAljtA6ZZC/0W5J4M1oteCQQ==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.2
       key: AQCFN+pT2OpPFRAAls3Af7M+cSG+Hg4Jl5KFXQ==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.3
       key: AQCLN+pTCEH+JhAAs3r2HDZHlXe2lwzbYCDSkg==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.4
       key: AQAAOOpTAHq2ARAA14mq+DUN8mJK46Xx9361Og==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.5
       key: AQAHOOpT0PILCRAAMEaJw45tzIZHUGluRASbmA==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.6
       key: AQANOOpTEOs2NxAAYOwaS+M/4kH5Jp0bNSFCbg==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.7
       key: AQAUOOpTuJvaKxAALIOMKVDVDOfjgrKNc+CVfg==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.8
       key: AQBtPupT4FQQHxAAWbueUsrAZkik4gQXH/sAsw==
       caps: [mon] allow profile osd
       caps: [osd] allow *
osd.9
       key: AQBzPupT6B6LMRAAzShzIYfV7dLihMGlICGG6A==
       caps: [mon] allow profile osd
       caps: [osd] allow *
client.admin
       key: AQAaN+pTsHY1JhAAtC7wWJ0Nvvizmhu7j/loNA==
       caps: [mds] allow
       caps: [mon] allow *
       caps: [osd] allow *
client.bootstrap-mds
       key: AQAbN+pTWCtKDxAA2vAgQuhd6gUR16BG/eH39A==
       caps: [mon] allow profile bootstrap-mds
client.bootstrap-osd
       key: AQAaN+pTMOj/NxAA01Wg3zznxXaPtLYYMKOBhQ==
       caps: [mon] allow profile bootstrap-osd

[root at lc002 ~]# cat /etc/ceph/ceph.client.admin.keyring
[client.admin]
  key = AQAaN+pTsHY1JhAAtC7wWJ0Nvvizmhu7j/loNA==

[root at lc002 ~]# ceph -s
    cluster 71fc3205-ce61-4e2e-b7f9-05fd59f8c6dd
     health HEALTH_WARN too few pgs per osd (16 < min 20); clock skew detected on mon.lc003
     monmap e2: 2 mons at {lc001=10.1.4.30:6789/0,lc003=10.1.4.32:6789/0}, election epoch 4, quorum 0,1 lc001,lc003
     osdmap e54: 12 osds: 12 up, 12 in
      pgmap v106: 192 pgs, 3 pools, 0 bytes data, 0 objects
            410 MB used, 452 GB / 452 GB avail
                 192 active+clean

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140813/cb1fb052/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux