Re: RMDA Bug?

"Liu, Changcheng" <changcheng.liu@xxxxxxxxx> · Fri, 1 Nov 2019 14:24:24 +0800

@Williams,
   Sorry for late reply. I'm busy on working getting Ceph/RDMA
   performance data these days.

   I'm using Intel RDMA NIC with small cluster, there's no serious issue
   happened.
   For Mellanox NIC, there's no problem with your ceph.conf from my perspective.

   Below is the steps that I used to deploy cluster
     1. server0: 172.16.1.4, /dev/nvme0n1, /dev/nvme1n1
     2. server1: 172.16.1.2, /dev/nvme0n1, /dev/nvme1n1

   Below is my deploy steps:
   [admin@server0 deploy]$ ceph-deploy new server0 --fsid 24280750-d4f7-4d4f-89e4-f95b8fab87ff
   [admin@server0 deploy]$ #change ceph.conf as below:
   [admin@server0 deploy]$ cat ceph.conf
   [global]
       cluster = ceph
       fsid = 24280750-d4f7-4d4f-89e4-f95b8fab87ff
       auth_cluster_required = cephx
       auth_service_required = cephx
       auth_client_required = cephx

       osd pool default size = 2
       osd pool default min size = 2
       osd pool default pg num = 64
       osd pool default pgp num = 128

       osd pool default crush rule = 0
       osd crush chooseleaf type = 1

       mon_allow_pool_delete=true
       osd_pool_default_pg_autoscale_mode=on

       ms_type = async+rdma

       ;----changcheng: change device to your dev name----------
       ms_async_rdma_device_name = irdma1
       ;----changcheng: ignore below parameters with Mellanox NIC--------
       ;ms_async_rdma_support_srq = false

       mon_initial_members = server0
       mon_host = 172.16.1.4

   [mon.rdmarhel0]
       host = server0
       mon addr = 172.16.1.4

   [admin@server0 deploy]$ ceph-deploy mon create-initial
   [admin@server0 deploy]$ ceph-deploy admin server0 server1
   [admin@server0 deploy]$ ceph-deploy mgr create server0
   [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme0n1 server0
   [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme1n1 server0
   [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme0n1 server1
   [admin@server0 deploy]$ ceph-deploy osd create --data /dev/nvme1n1 server1

B.R.
Changcheng 

On 08:27 Thu 31 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
>     1. When not defining a public and cluster network the OSD and MGR
>        nodes do not get recognised
> 
>      sudo ceph -s
> 
>      cluster:
> 
>        id:     820f1573-bc4a-4ee0-b702-80ba5ac13c25
> 
>        health: HEALTH_WARN
> 
>                3 osds down
> 
>                3 hosts (3 osds) down
> 
>                1 root (3 osds) down
> 
>                no active mgr
> 
>                too few PGs per OSD (21 < min 30)
> 
> 
>      services:
> 
>        mon: 3 daemons, quorum
>    cs04r-sc-com99-05,cs04r-sc-com99-07,cs04r-sc-com99-08 (age 5m)
> 
>        mgr: no daemons active (since 4m)
> 
>        osd: 3 osds: 0 up (since 9m), 3 in (since 9m)
> 
> 
>      data:
> 
>        pools:    1 pools, 64 pgs
> 
>        objects: 0 objects, 0 B
> 
>        usage:   3.0 GiB used, 114 GiB / 117 GiB avail
> 
>        pgs:       44 stale+active+clean
> 
>                     20 active+clean
> 
>    This is an issue within the ms_type being async+rdma as the daemons are
>    running:
> 
>      sudo systemctl status ceph-osd.target
> 
>    ● ceph-osd.target - ceph target allowing to start/stop all
>    ceph-osd@.service instances at once
> 
>       Loaded: loaded (/usr/lib/systemd/system/ceph-osd.target; enabled;
>    vendor preset: enabled)
> 
>       Active: active since Thu 2019-10-31 08:13:42 GMT; 8min ago
> 
>    sudo systemctl status ceph-mgr.target
> 
>    ● ceph-mgr.target - ceph target allowing to start/stop all
>    ceph-mgr@.service instances at once
>       Loaded: loaded (/usr/lib/systemd/system/ceph-mgr.target; enabled;
>    vendor preset: enabled)
>       Active: active since Thu 2019-10-31 08:13:33 GMT; 11min ago
> 
>    With the config being
> 
>      [global]
> 
>    fsid = 820f1573-bc4a-4ee0-b702-80ba5ac13c25
> 
>    mon_initial_members = node1, node2, node3
> 
>    mon_host = xxx.xx.xxx.aa,xxx.xx.xxx.ac, xxx.xx.xxx.ad
> 
>    auth_cluster_required = cephx
> 
>    auth_service_required = cephx
> 
>    auth_client_required = cephx
> 
>    ms_type = async+rdma
> 
>    ms_async_rdma_device_name = mlx4_0
>      __________________________________________________________________
> 
>    From: Liu, Changcheng <changcheng.liu@xxxxxxxxx>
>    Sent: 31 October 2019 01:09
>    To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
>    <gabryel.mason-williams@xxxxxxxxxxxxx>
>    Cc: dev@xxxxxxx <dev@xxxxxxx>
>    Subject: Re: RMDA Bug?
> 
>    >   2) I'll confirm with my colleague that whether cluster network is
>    really used in 14.2.4. We also hit similar problem these days even
>    using TCP async messenger.
>    [Changcheng]:
>    1) The problem should be already sovled in 14.2.4. We hit the problem
>    in 14.2.1
>    2) I'll try to verify your problem when I have time(I'm working on
>    other
>    affairs). There should be no problem when unifying both public/cluster
>    network with RDMA device.
>    On 23:22 Wed 30 Oct, Liu, Changcheng wrote:
>    > I'm working on master branch and deploy two nodes cluster. Data is
>    transferring over RDMA.
>    >       [admin@server0 ~]$ sudo ceph daemon osd.0 perf dump
>    AsyncMessenger::RDMAWorker-1
>    >       {
>    >           "AsyncMessenger::RDMAWorker-1": {
>    >               "tx_no_mem": 0,
>    >               "tx_parital_mem": 0,
>    >               "tx_failed_post": 0,
>    >               "tx_chunks": 26966,
>    >               "tx_bytes": 52789637,
>    >               "rx_chunks": 26916,
>    >               "rx_bytes": 52812278,
>    >               "pending_sent_conns": 0
>    >           }
>    >       }
>    >
>    > The only difference is that I don’t differentiate public/cluster
>    network in my cluster.
>    > You can try to make all public/cluster network use RDMA.
>    > Note:
>    >   1) If both public/cluster use RDMA, we can’t differentiate them in
>    different subnetwork. This is feature limited. I'm planning to solve it
>    in future)
>    >   2) I'll confirm with my colleague that whether cluster network is
>    really used in 14.2.4. We also hit similar problem these days even
>    using TCP async messenger.
>    >
>    > Below is my cluster's ceph configuration.
>    > I also attach the systemd patch used in my side.
>    >       [admin@server0 ~]$ cat /etc/ceph/ceph.conf
>    >       [global]
>    >           cluster = ceph
>    >           fsid = 24280750-d4f7-4d4f-89e4-f95b8fab87ff
>    >           auth_cluster_required = cephx
>    >           auth_service_required = cephx
>    >           auth_client_required = cephx
>    >
>    >           osd pool default size = 2
>    >           osd pool default min size = 2
>    >           osd pool default pg num = 64
>    >           osd pool default pgp num = 128
>    >
>    >           osd pool default crush rule = 0
>    >           osd crush chooseleaf type = 1
>    >
>    >           mon_allow_pool_delete=true
>    >           osd_pool_default_pg_autoscale_mode=off
>    >
>    >           ms_type = async+rdma
>    >           ms_async_rdma_device_name = mlx5_0
>    >
>    >           mon_initial_members = server0
>    >           mon_host = 172.16.1.4
>    >
>    >       [mon.rdmarhel0]
>    >           host = server0
>    >           mon addr = 172.16.1.4
>    >       [admin@server0 ~]$
>    >
>    > B.R.
>    > Changcheng
>    >
>    > On 13:07 Wed 30 Oct, Mason-Williams, Gabryel (DLSLtd,RAL,LSCI) wrote:
>    > >     1. The current problem is that it still sending data over the
>    ethernet
>    > >        instead of ib.
>    > >     2. [global]
>    > >        fsid=xxxx
>    > >        mon_initial_members = node1, node2, node3
>    > >        mon_host = xxx.xx.xxx.ab,xxx.xx.xxx.ac, xxx.xx.xxx.ad
>    > >        auth_cluster_required = cephx
>    > >        auth_service_required = cephx
>    > >        auth_client_required = cephx
>    > >        public_network = xxx.xx.xxx.0/24
>    > >        cluster_network = xx.xxx.0.0/16
>    > >        ms_cluster_type = async+rdma
>    > >        ms_type = async+rdma
>    > >        ms_public_type = async+posix
>    > >        [mgr]
>    > >        ms_type = async+posix
>    > >     3. The ceph cluster is deployed using ceph-deploy then once up
>    all of
>    > >        the daemons are turned off the rdma cluster config is then
>    sent
>    > >        around then once that is complete the daemons are turned
>    back on.
>    > >        The ulimit is set to unlimited, LimitMEMLOCK=infinity is set
>    on the
>    > >        ceph-disk@.service, ceph-mds@.service, ceph-mon@.service,
>    > >        ceph-osd@.service, ceph-radosgw@.service, aswell as
>    > >        PrivateDevices=no on ceph-mds@.service, ceph-mon@.service
>    and
>    > >        ceph-radosgw@.service. The ethernet mtu is set to 1000
>    > >
>    __________________________________________________________________
>    > >
>    > >    From: Liu, Changcheng <changcheng.liu@xxxxxxxxx>
>    > >    Sent: 30 October 2019 12:24
>    > >    To: Mason-Williams, Gabryel (DLSLtd,RAL,LSCI)
>    > >    <gabryel.mason-williams@xxxxxxxxxxxxx>
>    > >    Cc: dev@xxxxxxx <dev@xxxxxxx>
>    > >    Subject: Re: RMDA Bug?
>    > >
>    > >    1. What's the problem do you hit when using RDMA in 14.2.4? Any
>    log
>    > >    shows the error?
>    > >    2. What's your ceph.conf?
>    > >    3. How do you deploy the ceph cluster? RDMA need lock some
>    memory. So,
>    > >    it needs change some system configuration to meet with this
>    > >    requirement?
>    > >    On 11:21 Wed 30 Oct, Gabryel Mason-Williams wrote:
>    > >    > Liu, Changcheng wrote:
>    > >    > > On 07:31 Mon 28 Oct, Mason-Williams, Gabryel
>    (DLSLtd,RAL,LSCI)
>    > >    wrote:
>    > >    > > >     I am using ceph version 12.2.8
>    > >    > > >     (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous
>    (stable).
>    > >    > > >
>    > >    > > >     I have not checked the master branch do you think this
>    is an
>    > >    issue in
>    > >    > > >     luminous that has been removed in later versions?
>    I
>    > >    haven't hit problem
>    > >    > > on master branch. Ceph/RDMA changed a lot
>    > >    > >       from luminous to master branch.
>    > >    > >
>    > >    > >       Is below configuration really needed in
>    luminous/ceph.conf?
>    > >    > > >     ms_async_rdma_local_gid = xxxx          On master
>    branch,
>    > >    this
>    > >    > > parameter is not needed at all.
>    > >    > > B.R.
>    > >    > > Changcheng
>    > >    > > >
>    > >
>    __________________________________________________________________
>    > >    >
>    > >    > Thanks, the issue of the OSD's falling over seems to have gone
>    away
>    > >    updating to Nautilus 14.2.4. However, I am still unable to get
>    it to
>    > >    properly communicate over RDMA even with removing
>    > >    ms_async_rdma_local_gid.
>    > >    > _______________________________________________
>    > >    > Dev mailing list -- dev@xxxxxxx
>    > >    > To unsubscribe send an email to dev-leave@xxxxxxx
>    > >
>    > >
>    > >    --
>    > >
>    > >    This e-mail and any attachments may contain confidential,
>    copyright and
>    > >    or privileged material, and are for the use of the intended
>    addressee
>    > >    only. If you are not the intended addressee or an authorised
>    recipient
>    > >    of the addressee please notify us of receipt by returning the
>    e-mail
>    > >    and do not use, copy, retain, distribute or disclose the
>    information in
>    > >    or attached to the e-mail.
>    > >    Any opinions expressed within this e-mail are those of the
>    individual
>    > >    and not necessarily of Diamond Light Source Ltd.
>    > >    Diamond Light Source Ltd. cannot guarantee that this e-mail or
>    any
>    > >    attachments are free from viruses and we cannot accept liability
>    for
>    > >    any damage which you may sustain as a result of software viruses
>    which
>    > >    may be transmitted in or with the message.
>    > >    Diamond Light Source Limited (company no. 4375679). Registered
>    in
>    > >    England and Wales with its registered office at Diamond House,
>    Harwell
>    > >    Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE,
>    United
>    > >    Kingdom
>    >
>    > > _______________________________________________
>    > > Dev mailing list -- dev@xxxxxxx
>    > > To unsubscribe send an email to dev-leave@xxxxxxx
>    >
>    > From 40fa0d7096364b410e8242c46967029fb949876a Mon Sep 17 00:00:00
>    2001
>    > From: Changcheng Liu <changcheng.liu@xxxxxxxxxx>
>    > Date: Tue, 23 Jul 2019 18:50:57 +0800
>    > Subject: [PATCH] rdma systemd: grant access to /dev and unlimit mem
>    >
>    > Signed-off-by: Changcheng Liu <changcheng.liu@xxxxxxxxxx>
>    >
>    > diff --git a/systemd/ceph-fuse@xxxxxxxxxxx
>    b/systemd/ceph-fuse@xxxxxxxxxxx
>    > index d603042b12..ff2e9072f6 100644
>    > --- a/systemd/ceph-fuse@xxxxxxxxxxx
>    > +++ b/systemd/ceph-fuse@xxxxxxxxxxx
>    > @@ -12,6 +12,7 @@ ExecStart=/usr/bin/ceph-fuse -f --cluster
>    ${CLUSTER} %I
>    >  LockPersonality=true
>    >  MemoryDenyWriteExecute=true
>    >  NoNewPrivileges=true
>    > +LimitMEMLOCK=infinity
>    >  # ceph-fuse requires access to /dev fuse device
>    >  PrivateDevices=no
>    >  ProtectControlGroups=true
>    > diff --git a/systemd/ceph-mds@xxxxxxxxxxx
>    b/systemd/ceph-mds@xxxxxxxxxxx
>    > index 39a2e63105..0e58dfeeea 100644
>    > --- a/systemd/ceph-mds@xxxxxxxxxxx
>    > +++ b/systemd/ceph-mds@xxxxxxxxxxx
>    > @@ -14,7 +14,8 @@ ExecReload=/bin/kill -HUP $MAINPID
>    >  LockPersonality=true
>    >  MemoryDenyWriteExecute=true
>    >  NoNewPrivileges=true
>    > -PrivateDevices=yes
>    > +LimitMEMLOCK=infinity
>    > +PrivateDevices=no
>    >  ProtectControlGroups=true
>    >  ProtectHome=true
>    >  ProtectKernelModules=true
>    > diff --git a/systemd/ceph-mgr@xxxxxxxxxxx
>    b/systemd/ceph-mgr@xxxxxxxxxxx
>    > index c98f6378b9..682c7ecef3 100644
>    > --- a/systemd/ceph-mgr@xxxxxxxxxxx
>    > +++ b/systemd/ceph-mgr@xxxxxxxxxxx
>    > @@ -18,7 +18,8 @@ LockPersonality=true
>    >  MemoryDenyWriteExecute=false
>    >
>    >  NoNewPrivileges=true
>    > -PrivateDevices=yes
>    > +LimitMEMLOCK=infinity
>    > +PrivateDevices=no
>    >  ProtectControlGroups=true
>    >  ProtectHome=true
>    >  ProtectKernelModules=true
>    > diff --git a/systemd/ceph-mon@xxxxxxxxxxx
>    b/systemd/ceph-mon@xxxxxxxxxxx
>    > index c95fcabb26..51854fad96 100644
>    > --- a/systemd/ceph-mon@xxxxxxxxxxx
>    > +++ b/systemd/ceph-mon@xxxxxxxxxxx
>    > @@ -21,7 +21,8 @@ LockPersonality=true
>    >  MemoryDenyWriteExecute=true
>    >  # Need NewPrivileges via `sudo smartctl`
>    >  NoNewPrivileges=false
>    > -PrivateDevices=yes
>    > +LimitMEMLOCK=infinity
>    > +PrivateDevices=no
>    >  ProtectControlGroups=true
>    >  ProtectHome=true
>    >  ProtectKernelModules=true
>    > diff --git a/systemd/ceph-osd@xxxxxxxxxxx
>    b/systemd/ceph-osd@xxxxxxxxxxx
>    > index 1b5c9c82b8..06c20d7c83 100644
>    > --- a/systemd/ceph-osd@xxxxxxxxxxx
>    > +++ b/systemd/ceph-osd@xxxxxxxxxxx
>    > @@ -16,6 +16,8 @@ LockPersonality=true
>    >  MemoryDenyWriteExecute=true
>    >  # Need NewPrivileges via `sudo smartctl`
>    >  NoNewPrivileges=false
>    > +LimitMEMLOCK=infinity
>    > +PrivateDevices=no
>    >  ProtectControlGroups=true
>    >  ProtectHome=true
>    >  ProtectKernelModules=true
>    > diff --git a/systemd/ceph-radosgw@xxxxxxxxxxx
>    b/systemd/ceph-radosgw@xxxxxxxxxxx
>    > index 7e3ddf6c04..fe1a6b9159 100644
>    > --- a/systemd/ceph-radosgw@xxxxxxxxxxx
>    > +++ b/systemd/ceph-radosgw@xxxxxxxxxxx
>    > @@ -13,7 +13,8 @@ ExecStart=/usr/bin/radosgw -f --cluster ${CLUSTER}
>    --name client.%i --setuser ce
>    >  LockPersonality=true
>    >  MemoryDenyWriteExecute=true
>    >  NoNewPrivileges=true
>    > -PrivateDevices=yes
>    > +LimitMEMLOCK=infinity
>    > +PrivateDevices=no
>    >  ProtectControlGroups=true
>    >  ProtectHome=true
>    >  ProtectKernelModules=true
>    > diff --git a/systemd/ceph-volume@.service
>    b/systemd/ceph-volume@.service
>    > index c21002cecb..e2d1f67b85 100644
>    > --- a/systemd/ceph-volume@.service
>    > +++ b/systemd/ceph-volume@.service
>    > @@ -9,6 +9,7 @@ KillMode=none
>    >  Environment=CEPH_VOLUME_TIMEOUT=10000
>    >  ExecStart=/bin/sh -c 'timeout $CEPH_VOLUME_TIMEOUT
>    /usr/sbin/ceph-volume-systemd %i'
>    >  TimeoutSec=0
>    > +LimitMEMLOCK=infinity
>    >
>    >  [Install]
>    >  WantedBy=multi-user.target
>    > --
>    > 2.17.1
>    >
>    > _______________________________________________
>    > Dev mailing list -- dev@xxxxxxx
>    > To unsubscribe send an email to dev-leave@xxxxxxx
> 
> 
>    --
> 
>    This e-mail and any attachments may contain confidential, copyright and
>    or privileged material, and are for the use of the intended addressee
>    only. If you are not the intended addressee or an authorised recipient
>    of the addressee please notify us of receipt by returning the e-mail
>    and do not use, copy, retain, distribute or disclose the information in
>    or attached to the e-mail.
>    Any opinions expressed within this e-mail are those of the individual
>    and not necessarily of Diamond Light Source Ltd.
>    Diamond Light Source Ltd. cannot guarantee that this e-mail or any
>    attachments are free from viruses and we cannot accept liability for
>    any damage which you may sustain as a result of software viruses which
>    may be transmitted in or with the message.
>    Diamond Light Source Limited (company no. 4375679). Registered in
>    England and Wales with its registered office at Diamond House, Harwell
>    Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United
>    Kingdom
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx