On 07:31 Mon 28 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote: > I am using ceph version 12.2.8 > (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable). > > I have not checked the master branch do you think this is an issue in > luminous that has been removed in later versions? I haven't hit problem on master branch. Ceph/RDMA changed a lot from luminous to master branch. Is below configuration really needed in luminous/ceph.conf? > ms_async_rdma_local_gid = xxxx On master branch, this parameter is not needed at all. B.R. Changcheng > __________________________________________________________________ > > From: Liu, Changcheng <changcheng.liu@xxxxxxxxx> > Sent: 25 October 2019 18:04 > To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) > <gabryel.mason-williams@xxxxxxxxxxxxx> > Cc: ceph-users@xxxxxxxx <ceph-users@xxxxxxxx>; dev@xxxxxxx > <dev@xxxxxxx> > Subject: Re: RMDA Bug? > > What's your ceph version? Have you verified whether the problem could > be > reproduced on master branch? > On 08:33 Fri 25 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote: > > I am currently trying to run Ceph on RDMA, either RoCE 1 or 2. > However, > > I am experiencing issues with this. > > > > When using Ceph on RDMA I experience issues where OSD’s will > randomly > > become unreachable even if the cluster is left alone alone, it > also is > > not properly talking over RDMA and using Ethernet when the config > > states it should as shown by the same results in the bench marking > of > > the two setups. > > > > After reloading the cluster > > [cid:36020940-0085-40fc-bb5b-d91de6ace453] > > > > After 5m 9s the cluster went from being healthy to down. > > > > [cid:ed084bcc-0b97-44bd-9648-ce2e06859cd5] > > > > This problem even happens when running a bench mark test on the > > cluster, OSD’s will just fall over. Another curious issue is that > it is > > not properly talking over RDMA as well and instead using the > Ethernet. > > > > [cid:05e9dc68-075e-425d-b76b-ce7fa1d2f7a8] > > > > Next test: > > > > [cid:4183557e-b1da-41f3-afc3-f081b9fb4034] > > > > The config used for the RDMA is a so: > > > > [global] > > > > fsid = aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa > > > > mon_initial_members = node1, node2, node3 > > > > mon_host =xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx > > > > auth_cluster_required = cephx > > > > auth_service_required =cephx > > > > auth_client_required = cephx > > > > public_network = xxx.xxx.xxx.xxx/24 > > > > cluster_network = yyy.yyy.yyy.yyy/16 > > > > ms_cluster_type =async+rdma > > > > ms_public_type = async+posix > > > > ms_async_rdma_device_name = mlx4_0 > > > > [osd.0] > > > > ms_async_rdma_local_gid = xxxx > > > > [osd.1] > > > > ms_async_rdma_local_gid = xxxx > > > > [osd.2] > > > > ms_async_rdma_local_gid =xxxx > > > > Tests to check the system is using RDMA > > > > sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show > | > > grep ms_cluster > > > > OUTPUT > > > > "ms_cluster_type": "async+rdma", > > > > sudo ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1 > > > > OUTPUT > > > > { > > > > "AsyncMessenger::RDMAWorker-1": { > > > > "tx_no_mem": 0, > > > > "tx_parital_mem": 0, > > > > "tx_failed_post": 0, > > > > "rx_no_registered_mem": 0, > > > > "tx_chunks": 9, > > > > "tx_bytes": 2529, > > > > "rx_chunks": 0, > > > > "rx_bytes": 0, > > > > "pending_sent_conns": 0 > > > > } > > > > } > > > > When running over Ethernet I have a completely stable system with > the > > current benchmarks as so > > > > [cid:544ecbbc-10d9-43e6-ab2f-aa7c2bcd88c0] > > > > Config setup when using Ethernet is > > > > The Config setup when using Ethernet is > > > > [global] > > > > fsid = aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa > > > > mon_initial_members = node1, node2, node3 > > > > mon_host =xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx > > > > auth_cluster_required = cephx > > > > auth_service_required =cephx > > > > auth_client_required = cephx > > > > public_network = xxx.xxx.xxx.xxx/24 > > > > cluster_network = yyy.yyy.yyy.yyy/16 > > > > ms_cluster_type =async+posix > > > > ms_public_type = async+posix > > > > ms_async_rdma_device_name = mlx4_0 > > > > [osd.0] > > > > ms_async_rdma_local_gid = xxxx > > > > [osd.1] > > > > ms_async_rdma_local_gid = xxxx > > > > [osd.2] > > > > ms_async_rdma_local_gid =xxxx > > Tests to check the system is using async+posix > > > > sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show > | > > grep ms_cluster > > > > OUTPUT > > > > "ms_cluster_type": "async+posix" > > > > sudo ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1 > > > > OUTPUT > > > > {} > > > > This clearly a issue with RDMA and not with the OSD's shown by the > fact > > the system is completely fine over Ethernet and not with RDMA. > > > > Any guidance or ideas on how to approach this problem to make Ceph > work > > with RDMA would be greatly appreciated. > > > > Regards > > > > Gabryel Mason-Williams, Placement Student > > > > Address: Diamond Light Source Ltd., Diamond House, Harwell Science > & > > Innovation Campus, Didcot, Oxfordshire OX11 0DE > > > > Email: gabryel.mason-williams@xxxxxxxxxxxxx > > > > > > -- > > > > This e-mail and any attachments may contain confidential, > copyright and > > or privileged material, and are for the use of the intended > addressee > > only. If you are not the intended addressee or an authorised > recipient > > of the addressee please notify us of receipt by returning the > e-mail > > and do not use, copy, retain, distribute or disclose the > information in > > or attached to the e-mail. > > Any opinions expressed within this e-mail are those of the > individual > > and not necessarily of Diamond Light Source Ltd. > > Diamond Light Source Ltd. cannot guarantee that this e-mail or any > > attachments are free from viruses and we cannot accept liability > for > > any damage which you may sustain as a result of software viruses > which > > may be transmitted in or with the message. > > Diamond Light Source Limited (company no. 4375679). Registered in > > England and Wales with its registered office at Diamond House, > Harwell > > Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, > United > > Kingdom > > _______________________________________________ > > Dev mailing list -- dev@xxxxxxx > > To unsubscribe send an email to dev-leave@xxxxxxx > > > -- > > This e-mail and any attachments may contain confidential, copyright and > or privileged material, and are for the use of the intended addressee > only. If you are not the intended addressee or an authorised recipient > of the addressee please notify us of receipt by returning the e-mail > and do not use, copy, retain, distribute or disclose the information in > or attached to the e-mail. > Any opinions expressed within this e-mail are those of the individual > and not necessarily of Diamond Light Source Ltd. > Diamond Light Source Ltd. cannot guarantee that this e-mail or any > attachments are free from viruses and we cannot accept liability for > any damage which you may sustain as a result of software viruses which > may be transmitted in or with the message. > Diamond Light Source Limited (company no. 4375679). Registered in > England and Wales with its registered office at Diamond House, Harwell > Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United > Kingdom _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx