Hi.
I have rebuild my ceph yesterday with the latest master branch and the problem still occurs.
I also found that the number of receive errors will increase during the testing (/sys/class/infiniband/mlx4_0/ports/1/counters//port_rcv_errors)
I have rebuild my ceph yesterday with the latest master branch and the problem still occurs.
I also found that the number of receive errors will increase during the testing (/sys/class/infiniband/mlx4_0/ports/1/counters//port_rcv_errors)
and I think that is the reason why the osd's connection will broken and I will try to figure out.
Thanks.
Best Regards,
Hung-Wei Chiu(邱宏瑋)
Hung-Wei Chiu(邱宏瑋)
--
Computer Center, Department of Computer Science
National Chiao Tung University
Computer Center, Department of Computer Science
National Chiao Tung University
2017-03-21 5:07 GMT+08:00 Haomai Wang <haomai@xxxxxxxx>:
plz uses master branch to test rdmaOn Sun, Mar 19, 2017 at 11:08 PM, Hung-Wei Chiu (邱宏瑋) <hwchiu@xxxxxxxxxxxxxx> wrote:______________________________HiI want to test the performance for Ceph with RDMA, so I build the ceph with RDMA and deploy into my test environment manually.
I use the fio for my performance evaluation and it works fine if the Cepu use the async + posix as its ms_type.
After changing the ms_type from async + posix to async + rdma, some osd's status will turn down during the performance testing and that causing the fio can't finish its job.The log file of those strange OSD shows that there're something wrong when OSD try to send a message and you can see below....2017-03-20 09:43:10.096042 7faac163e700 -1 Infiniband recv_msg got error -104: (104) Connection reset by peer2017-03-20 09:43:10.096314 7faac163e700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6813/32315 conn(0x563de5282000 :-1 s=STATE_OPEN pgs=264 cs=29 l=0).fault initiating reconnect2017-03-20 09:43:10.251606 7faac1e3f700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe2017-03-20 09:43:10.251755 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6821/32509 conn(0x563de51f1000 :-1 s=STATE_OPEN pgs=314 cs=24 l=0).fault initiating reconnect2017-03-20 09:43:10.254103 7faac1e3f700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe2017-03-20 09:43:10.254375 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.15:6821/48196 conn(0x563de514b000 :6809 s=STATE_OPEN pgs=275 cs=30 l=0).fault initiating reconnect2017-03-20 09:43:10.260622 7faac1e3f700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe2017-03-20 09:43:10.260693 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.15:6805/47835 conn(0x563de537d800 :-1 s=STATE_OPEN pgs=310 cs=11 l=0).fault with nothing to send, going to standby2017-03-20 09:43:10.264621 7faac163e700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe2017-03-20 09:43:10.264682 7faac163e700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.15:6829/48397 conn(0x563de5fdb000 :-1 s=STATE_OPEN pgs=231 cs=23 l=0).fault with nothing to send, going to standby2017-03-20 09:43:10.291832 7faac163e700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe2017-03-20 09:43:10.291895 7faac163e700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6817/32412 conn(0x563de50f5800 :-1 s=STATE_OPEN pgs=245 cs=25 l=0).fault initiating reconnect2017-03-20 09:43:10.387540 7faac2e41700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe2017-03-20 09:43:10.387565 7faac2e41700 -1 Infiniband send_msg send returned error 32: (32) Broken pipe2017-03-20 09:43:10.387635 7faac2e41700 0 -- 10.0.0.16:6809/23853 >> 10.0.0.17:6801/32098 conn(0x563de51ab800 :6809 s=STATE_OPEN pgs=268 cs=23 l=0).fault with nothing to send, going to standby2017-03-20 09:43:11.453373 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6803 osd.0 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)2017-03-20 09:43:11.453422 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6807 osd.1 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)2017-03-20 09:43:11.453435 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6811 osd.2 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)2017-03-20 09:43:11.453444 7faabdee0700 -1 osd.10 902 heartbeat_check: no reply from 10.0.0.15:6815 osd.3 since back 2017-03-20 09:42:50.610507 front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)...The following is my environment.[Software]Ceph Version: ceph version 12.0.0-1356-g7ba32cb (I build my self with master branch)Deployment: Without ceph-deploy and systemd, just manually invoke every daemons.Host: Ubuntu 16.04.1 LTS (x86_64 ), with linux kernel 4.4.0-66-generic.NIC: Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]NIC Driver: MLNX_OFED_LINUX-4.0-1.0.1.0 (OFED-4.0-1.0.1): [Configuration]Ceph.confFio.conf[global]fsid = 0612cc7e-6239-456c-978b-b4df781fe831 mon initial members = ceph-1,ceph-2,ceph-3mon host = 10.0.0.15,10.0.0.16,10.0.0.17osd pool default size = 2osd pool default pg num = 1024osd pool default pgp num = 1024ms_type=async+rdmams_async_rdma_device_name = mlx4_0[global]ioengine=rbdclientname=adminpool=rbdrbdname=rbdclustername=cephruntime=120iodepth=128numjobs=6group_reportingsize=256Gdirect=1ramp_time=5[r75w25]bs=4krw=randrwrwmixread=75[Cluster Env]
- Total three Node.
- 3 ceph monitors on each node.
- 8 ceph osd on each node (total 24 osd).
Thanks_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com