ceph rdma + IB network error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all: 

 

By following the instructions:

(https://community.mellanox.com/docs/DOC-2721)

(https://community.mellanox.com/docs/DOC-2693)

(http://hwchiu.com/2017-05-03-ceph-with-rdma.html)

 

I'm trying to configure CEPH with RDMA feature on environments as follows:

 

CentOS Linux release 7.2.1511 (Core)

MLNX_OFED_LINUX-4.4-1.0.0.0:

Mellanox Technologies MT27500 Family [ConnectX-3]

 

rping works between all nodes and add these lines to ceph.conf to enable RDMA:

 

public_network = 10.10.121.0/24

cluster_network = 10.10.121.0/24

ms_type = async+rdma

ms_async_rdma_device_name = mlx4_0

ms_async_rdma_port_num = 2

 

IB network is using 10.10.121.0/24 addresses and "ibdev2netdev" command shows port 2 is up.

Error occurs when running "ceph-deploy --overwrite-conf mon create-initial", ceph-deploy log details:

 

[2018-07-12 17:53:48,943][ceph_deploy.conf][DEBUG ] found configuration file at: /home/user1/.cephdeploy.conf

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ] Invoked (1.5.37): /usr/bin/ceph-deploy --overwrite-conf mon create-initial

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ] ceph-deploy options:

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]  username                      : None

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]  verbose                       : False

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]  overwrite_conf                : True

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]  subcommand                    : create-initial

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]  quiet                         : False

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf object at 0x27e6210>

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]  cluster                       : ceph

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]  func                          : <function mon at 0x2a7d2a8>

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]  ceph_conf                     : None

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]  default_release               : False

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]  keyrings                      : None

[2018-07-12 17:53:48,947][ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts node1

[2018-07-12 17:53:48,947][ceph_deploy.mon][DEBUG ] detecting platform for host node1 ...

[2018-07-12 17:53:49,005][node1][DEBUG ] connection detected need for sudo

[2018-07-12 17:53:49,039][node1][DEBUG ] connected to host: node1

[2018-07-12 17:53:49,040][node1][DEBUG ] detect platform information from remote host

[2018-07-12 17:53:49,073][node1][DEBUG ] detect machine type

[2018-07-12 17:53:49,078][node1][DEBUG ] find the location of an executable

[2018-07-12 17:53:49,079][ceph_deploy.mon][INFO  ] distro info: CentOS Linux 7.2.1511 Core

[2018-07-12 17:53:49,079][node1][DEBUG ] determining if provided host has same hostname in remote

[2018-07-12 17:53:49,079][node1][DEBUG ] get remote short hostname

[2018-07-12 17:53:49,080][node1][DEBUG ] deploying mon to node1

[2018-07-12 17:53:49,080][node1][DEBUG ] get remote short hostname

[2018-07-12 17:53:49,081][node1][DEBUG ] remote hostname: node1

[2018-07-12 17:53:49,083][node1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf

[2018-07-12 17:53:49,084][node1][DEBUG ] create the mon path if it does not exist

[2018-07-12 17:53:49,085][node1][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-node1/done

[2018-07-12 17:53:49,085][node1][DEBUG ] create a done file to avoid re-doing the mon deployment

[2018-07-12 17:53:49,086][node1][DEBUG ] create the init path if it does not exist

[2018-07-12 17:53:49,089][node1][INFO  ] Running command: sudo systemctl enable ceph.target

[2018-07-12 17:53:49,365][node1][INFO  ] Running command: sudo systemctl enable ceph-mon@node1

[2018-07-12 17:53:49,588][node1][INFO  ] Running command: sudo systemctl start ceph-mon@node1

[2018-07-12 17:53:51,762][node1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.node1.asok mon_status

[2018-07-12 17:53:51,979][node1][DEBUG ] ********************************************************************************

[2018-07-12 17:53:51,979][node1][DEBUG ] status for monitor: mon.node1

[2018-07-12 17:53:51,980][node1][DEBUG ] {

[2018-07-12 17:53:51,980][node1][DEBUG ]   "election_epoch": 3,

[2018-07-12 17:53:51,980][node1][DEBUG ]   "extra_probe_peers": [],

[2018-07-12 17:53:51,980][node1][DEBUG ]   "feature_map": {

[2018-07-12 17:53:51,981][node1][DEBUG ]     "mon": {

[2018-07-12 17:53:51,981][node1][DEBUG ]       "group": {

[2018-07-12 17:53:51,981][node1][DEBUG ]         "features": "0x1ffddff8eea4fffb",

[2018-07-12 17:53:51,981][node1][DEBUG ]         "num": 1,

[2018-07-12 17:53:51,981][node1][DEBUG ]         "release": "luminous"

[2018-07-12 17:53:51,981][node1][DEBUG ]       }

[2018-07-12 17:53:51,981][node1][DEBUG ]     }

[2018-07-12 17:53:51,982][node1][DEBUG ]   },

[2018-07-12 17:53:51,982][node1][DEBUG ]   "features": {

[2018-07-12 17:53:51,982][node1][DEBUG ]     "quorum_con": "2305244844532236283",

[2018-07-12 17:53:51,982][node1][DEBUG ]     "quorum_mon": [

[2018-07-12 17:53:51,982][node1][DEBUG ]       "kraken",

[2018-07-12 17:53:51,982][node1][DEBUG ]       "luminous"

[2018-07-12 17:53:51,982][node1][DEBUG ]     ],

[2018-07-12 17:53:51,982][node1][DEBUG ]     "required_con": "153140804152475648",

[2018-07-12 17:53:51,983][node1][DEBUG ]     "required_mon": [

[2018-07-12 17:53:51,983][node1][DEBUG ]       "kraken",

[2018-07-12 17:53:51,983][node1][DEBUG ]       "luminous"

[2018-07-12 17:53:51,983][node1][DEBUG ]     ]

[2018-07-12 17:53:51,983][node1][DEBUG ]   },

[2018-07-12 17:53:51,983][node1][DEBUG ]   "monmap": {

[2018-07-12 17:53:51,983][node1][DEBUG ]     "created": "2018-07-12 17:41:24.243749",

[2018-07-12 17:53:51,984][node1][DEBUG ]     "epoch": 1,

[2018-07-12 17:53:51,984][node1][DEBUG ]     "features": {

[2018-07-12 17:53:51,984][node1][DEBUG ]       "optional": [],

[2018-07-12 17:53:51,984][node1][DEBUG ]       "persistent": [

[2018-07-12 17:53:51,984][node1][DEBUG ]         "kraken",

[2018-07-12 17:53:51,984][node1][DEBUG ]         "luminous"

[2018-07-12 17:53:51,984][node1][DEBUG ]       ]

[2018-07-12 17:53:51,984][node1][DEBUG ]     },

[2018-07-12 17:53:51,985][node1][DEBUG ]     "fsid": "9317bc6a-ea20-4376-a390-52afa0b81353",

[2018-07-12 17:53:51,985][node1][DEBUG ]     "modified": "2018-07-12 17:41:24.243749",

[2018-07-12 17:53:51,985][node1][DEBUG ]     "mons": [

[2018-07-12 17:53:51,985][node1][DEBUG ]       {

[2018-07-12 17:53:51,985][node1][DEBUG ]         "addr": "10.10.121.25:6789/0",

[2018-07-12 17:53:51,985][node1][DEBUG ]         "name": "node1",

[2018-07-12 17:53:51,985][node1][DEBUG ]         "public_addr": "10.10.121.25:6789/0",

[2018-07-12 17:53:51,986][node1][DEBUG ]         "rank": 0

[2018-07-12 17:53:51,986][node1][DEBUG ]       }

[2018-07-12 17:53:51,986][node1][DEBUG ]     ]

[2018-07-12 17:53:51,986][node1][DEBUG ]   },

[2018-07-12 17:53:51,986][node1][DEBUG ]   "name": "node1",

[2018-07-12 17:53:51,986][node1][DEBUG ]   "outside_quorum": [],

[2018-07-12 17:53:51,986][node1][DEBUG ]   "quorum": [

[2018-07-12 17:53:51,986][node1][DEBUG ]     0

[2018-07-12 17:53:51,987][node1][DEBUG ]   ],

[2018-07-12 17:53:51,987][node1][DEBUG ]   "rank": 0,

[2018-07-12 17:53:51,987][node1][DEBUG ]   "state": "leader",

[2018-07-12 17:53:51,987][node1][DEBUG ]   "sync_provider": []

[2018-07-12 17:53:51,987][node1][DEBUG ] }

[2018-07-12 17:53:51,987][node1][DEBUG ] ********************************************************************************

[2018-07-12 17:53:51,987][node1][INFO  ] monitor: mon.node1 is running

[2018-07-12 17:53:51,989][node1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.node1.asok mon_status

[2018-07-12 17:53:52,156][ceph_deploy.mon][INFO  ] processing monitor mon.node1

[2018-07-12 17:53:52,194][node1][DEBUG ] connection detected need for sudo

[2018-07-12 17:53:52,230][node1][DEBUG ] connected to host: node1

[2018-07-12 17:53:52,231][node1][DEBUG ] detect platform information from remote host

[2018-07-12 17:53:52,265][node1][DEBUG ] detect machine type

[2018-07-12 17:53:52,270][node1][DEBUG ] find the location of an executable

[2018-07-12 17:53:52,273][node1][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.node1.asok mon_status

[2018-07-12 17:53:52,439][ceph_deploy.mon][INFO  ] mon.node1 monitor has reached quorum!

[2018-07-12 17:53:52,440][ceph_deploy.mon][INFO  ] all initial monitors are running and have formed quorum

[2018-07-12 17:53:52,440][ceph_deploy.mon][INFO  ] Running gatherkeys...

[2018-07-12 17:53:52,441][ceph_deploy.gatherkeys][INFO  ] Storing keys in temp directory /tmp/tmp8bdYT6

[2018-07-12 17:53:52,477][node1][DEBUG ] connection detected need for sudo

[2018-07-12 17:53:52,510][node1][DEBUG ] connected to host: node1

[2018-07-12 17:53:52,511][node1][DEBUG ] detect platform information from remote host

[2018-07-12 17:53:52,552][node1][DEBUG ] detect machine type

[2018-07-12 17:53:52,558][node1][DEBUG ] get remote short hostname

[2018-07-12 17:53:52,559][node1][DEBUG ] fetch remote file

[2018-07-12 17:53:52,562][node1][INFO  ] Running command: sudo /usr/bin/ceph --connect-timeout=25 --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.node1.asok mon_status

[2018-07-12 17:53:52,731][node1][INFO  ] Running command: sudo /usr/bin/ceph --connect-timeout=25 --cluster=ceph --name mon. --keyring=/var/lib/ceph/mon/ceph-node1/keyring auth get client.admin

[2018-07-12 17:54:18,059][node1][ERROR ] "ceph auth get-or-create for keytype admin returned 1

[2018-07-12 17:54:18,059][node1][DEBUG ] Cluster connection interrupted or timed out

[2018-07-12 17:54:18,059][node1][ERROR ] Failed to return 'admin' key from host node1

[2018-07-12 17:54:18,059][ceph_deploy.gatherkeys][ERROR ] Failed to connect to host:node1

[2018-07-12 17:54:18,060][ceph_deploy.gatherkeys][INFO  ] Destroy temp directory /tmp/tmp8bdYT6

[2018-07-12 17:54:18,060][ceph_deploy][ERROR ] RuntimeError: Failed to connect any mon

 

ceph-mon service is up but cannot be connected to reach, "ceph -s" also returns same types of error:

 

2018-07-13 10:44:21.169536 7fa570d4e700  0 monclient(hunting): authenticate timed out after 300

2018-07-13 10:44:21.169579 7fa570d4e700  0 librados: client.admin authentication error (110) Connection timed out

[errno 110] error connecting to the cluster


I'am running the ceph version 12.2.4 luminous stable, anyone  has any suggestion about this issue?

 

Thx


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux