Re: RMDA Bug?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07:31 Mon 28 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
>    I am using ceph version 12.2.8
>    (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable).
> 
>    I have not checked the master branch do you think this is an issue in
>    luminous that has been removed in later versions?
     I haven't hit problem on master branch. Ceph/RDMA changed a lot
	 from luminous to master branch.

	 Is below configuration really needed in luminous/ceph.conf?
         >    ms_async_rdma_local_gid = xxxx
	 On master branch, this parameter is not needed at all.
B.R.
Changcheng
>      __________________________________________________________________
> 
>    From: Liu, Changcheng <changcheng.liu@xxxxxxxxx>
>    Sent: 25 October 2019 18:04
>    To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
>    <gabryel.mason-williams@xxxxxxxxxxxxx>
>    Cc: ceph-users@xxxxxxxx <ceph-users@xxxxxxxx>; dev@xxxxxxx
>    <dev@xxxxxxx>
>    Subject: Re: RMDA Bug?
> 
>    What's your ceph version? Have you verified whether the problem could
>    be
>    reproduced on master branch?
>    On 08:33 Fri 25 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
>    >    I am currently trying to run Ceph on RDMA, either RoCE 1 or 2.
>    However,
>    >    I am experiencing issues with this.
>    >
>    >    When using Ceph on RDMA I experience issues where OSD’s will
>    randomly
>    >    become unreachable even if the cluster is left alone alone, it
>    also is
>    >    not properly talking over RDMA and using Ethernet when the config
>    >    states it should as shown by the same results in the bench marking
>    of
>    >    the two setups.
>    >
>    >    After reloading the cluster
>    >    [cid:36020940-0085-40fc-bb5b-d91de6ace453]
>    >
>    >    After 5m 9s the cluster went from being healthy to down.
>    >
>    >    [cid:ed084bcc-0b97-44bd-9648-ce2e06859cd5]
>    >
>    >    This problem even happens when running a bench mark test on the
>    >    cluster, OSD’s will just fall over. Another curious issue is that
>    it is
>    >    not properly talking over RDMA as well and instead using the
>    Ethernet.
>    >
>    >    [cid:05e9dc68-075e-425d-b76b-ce7fa1d2f7a8]
>    >
>    >    Next test:
>    >
>    >    [cid:4183557e-b1da-41f3-afc3-f081b9fb4034]
>    >
>    >    The config used for the RDMA is a so:
>    >
>    >    [global]
>    >
>    >    fsid = aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
>    >
>    >    mon_initial_members = node1, node2, node3
>    >
>    >    mon_host =xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx
>    >
>    >    auth_cluster_required = cephx
>    >
>    >    auth_service_required =cephx
>    >
>    >    auth_client_required = cephx
>    >
>    >    public_network = xxx.xxx.xxx.xxx/24
>    >
>    >    cluster_network = yyy.yyy.yyy.yyy/16
>    >
>    >    ms_cluster_type =async+rdma
>    >
>    >    ms_public_type = async+posix
>    >
>    >    ms_async_rdma_device_name = mlx4_0
>    >
>    >    [osd.0]
>    >
>    >    ms_async_rdma_local_gid = xxxx
>    >
>    >    [osd.1]
>    >
>    >    ms_async_rdma_local_gid = xxxx
>    >
>    >    [osd.2]
>    >
>    >    ms_async_rdma_local_gid =xxxx
>    >
>    >    Tests to check the system is using RDMA
>    >
>    >    sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
>    |
>    >    grep ms_cluster
>    >
>    >    OUTPUT
>    >
>    >    "ms_cluster_type": "async+rdma",
>    >
>    >    sudo ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1
>    >
>    >    OUTPUT
>    >
>    >    {
>    >
>    >    "AsyncMessenger::RDMAWorker-1": {
>    >
>    >    "tx_no_mem": 0,
>    >
>    >    "tx_parital_mem": 0,
>    >
>    >    "tx_failed_post": 0,
>    >
>    >    "rx_no_registered_mem": 0,
>    >
>    >    "tx_chunks": 9,
>    >
>    >    "tx_bytes": 2529,
>    >
>    >    "rx_chunks": 0,
>    >
>    >    "rx_bytes": 0,
>    >
>    >    "pending_sent_conns": 0
>    >
>    >    }
>    >
>    >    }
>    >
>    >    When running over Ethernet I have a completely stable system with
>    the
>    >    current benchmarks as so
>    >
>    >    [cid:544ecbbc-10d9-43e6-ab2f-aa7c2bcd88c0]
>    >
>    >    Config setup when using Ethernet is
>    >
>    >    The Config setup when using Ethernet is
>    >
>    >    [global]
>    >
>    >    fsid = aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
>    >
>    >    mon_initial_members = node1, node2, node3
>    >
>    >    mon_host =xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx
>    >
>    >    auth_cluster_required = cephx
>    >
>    >    auth_service_required =cephx
>    >
>    >    auth_client_required = cephx
>    >
>    >    public_network = xxx.xxx.xxx.xxx/24
>    >
>    >    cluster_network = yyy.yyy.yyy.yyy/16
>    >
>    >    ms_cluster_type =async+posix
>    >
>    >    ms_public_type = async+posix
>    >
>    >    ms_async_rdma_device_name = mlx4_0
>    >
>    >    [osd.0]
>    >
>    >    ms_async_rdma_local_gid = xxxx
>    >
>    >    [osd.1]
>    >
>    >    ms_async_rdma_local_gid = xxxx
>    >
>    >    [osd.2]
>    >
>    >    ms_async_rdma_local_gid =xxxx
>    >    Tests to check the system is using async+posix
>    >
>    >    sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
>    |
>    >    grep ms_cluster
>    >
>    >    OUTPUT
>    >
>    >    "ms_cluster_type": "async+posix"
>    >
>    >    sudo ceph daemon osd.0 perf dump AsyncMessenger::RDMAWorker-1
>    >
>    >    OUTPUT
>    >
>    >    {}
>    >
>    >    This clearly a issue with RDMA and not with the OSD's shown by the
>    fact
>    >    the system is completely fine over Ethernet and not with RDMA.
>    >
>    >    Any guidance or ideas on how to approach this problem to make Ceph
>    work
>    >    with RDMA would be greatly appreciated.
>    >
>    >    Regards
>    >
>    >    Gabryel Mason-Williams, Placement Student
>    >
>    >    Address: Diamond Light Source Ltd., Diamond House, Harwell Science
>    &
>    >    Innovation Campus, Didcot, Oxfordshire OX11 0DE
>    >
>    >    Email: gabryel.mason-williams@xxxxxxxxxxxxx
>    >
>    >
>    >    --
>    >
>    >    This e-mail and any attachments may contain confidential,
>    copyright and
>    >    or privileged material, and are for the use of the intended
>    addressee
>    >    only. If you are not the intended addressee or an authorised
>    recipient
>    >    of the addressee please notify us of receipt by returning the
>    e-mail
>    >    and do not use, copy, retain, distribute or disclose the
>    information in
>    >    or attached to the e-mail.
>    >    Any opinions expressed within this e-mail are those of the
>    individual
>    >    and not necessarily of Diamond Light Source Ltd.
>    >    Diamond Light Source Ltd. cannot guarantee that this e-mail or any
>    >    attachments are free from viruses and we cannot accept liability
>    for
>    >    any damage which you may sustain as a result of software viruses
>    which
>    >    may be transmitted in or with the message.
>    >    Diamond Light Source Limited (company no. 4375679). Registered in
>    >    England and Wales with its registered office at Diamond House,
>    Harwell
>    >    Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE,
>    United
>    >    Kingdom
>    > _______________________________________________
>    > Dev mailing list -- dev@xxxxxxx
>    > To unsubscribe send an email to dev-leave@xxxxxxx
> 
> 
>    --
> 
>    This e-mail and any attachments may contain confidential, copyright and
>    or privileged material, and are for the use of the intended addressee
>    only. If you are not the intended addressee or an authorised recipient
>    of the addressee please notify us of receipt by returning the e-mail
>    and do not use, copy, retain, distribute or disclose the information in
>    or attached to the e-mail.
>    Any opinions expressed within this e-mail are those of the individual
>    and not necessarily of Diamond Light Source Ltd.
>    Diamond Light Source Ltd. cannot guarantee that this e-mail or any
>    attachments are free from viruses and we cannot accept liability for
>    any damage which you may sustain as a result of software viruses which
>    may be transmitted in or with the message.
>    Diamond Light Source Limited (company no. 4375679). Registered in
>    England and Wales with its registered office at Diamond House, Harwell
>    Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United
>    Kingdom
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux