Re: RDMA with mellanox connect x3pro on debian stretch and proxmox v5.0 kernel 4.10.17-3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Haomai,

 ibstat
CA 'mlx4_0'
        CA type: MT4103
        Number of ports: 2
        Firmware version: 2.40.7000
        Hardware version: 0
        Node GUID: 0x248a070300e26070
        System image GUID: 0x248a070300e26070
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x04010000
                Port GUID: 0x268a07fffee26070
                Link layer: Ethernet
        Port 2:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x04010000
                Port GUID: 0x268a07fffee26071
                Link layer: Ethernet


Port2 is ceph
Port1 is proxmox cluster ineterface ...

Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 171 4802507
Am 27.09.2017 um 14:50 schrieb Haomai Wang:
> On Wed, Sep 27, 2017 at 8:33 PM, Gerhard W. Recher
> <gerhard.recher@xxxxxxxxxxx> wrote:
>> Hi Folks!
>>
>> I'm totally stuck
>>
>> rdma is running on my nics, rping udaddy etc will give positive results.
>>
>> cluster consist of:
>> proxmox-ve: 5.0-23 (running kernel: 4.10.17-3-pve)
>> pve-manager: 5.0-32 (running version: 5.0-32/2560e073)
>>
>> system(4 nodes): Supermicro 2028U-TN24R4T+
>>
>> 2 port Mellanox connect x3pro 56Gbit
>> 4 port intel 10GigE
>> memory: 768 GBytes
>> CPU DUAL  Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
>>
>> ceph: 28 osds
>> 24  Intel Nvme 2000GB Intel SSD DC P3520, 2,5", PCIe 3.0 x4,
>>  4  Intel Nvme 1,6TB Intel SSD DC P3700, 2,5", U.2 PCIe 3.0
>>
>>
>> ceph is running on bluestore,  engaging rdma within ceph (version
>> 12.2.0-pve1) will lead into this crash
>>
>>
>> ceph.conf:
>> [global]
>> ms_type=async+rdma
>> ms_cluster_type = async+rdma
>> ms_async_rdma_port_num=2
> I guess it should be 0. what's your result of "ibstat"
>
>> ms_async_rdma_device_name=mlx4_0
>> ...
>>
>>
>>
>> -- Reboot --
>> Sep 26 18:56:10 pve02 systemd[1]: Started Ceph cluster manager daemon.
>> Sep 26 18:56:10 pve02 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]: 2017-09-26 18:56:10.427474 7f0e2137e700 -1 Infiniband binding_port  port not found
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]: /home/builder/source/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc: In function 'void Device::binding_port(CephContext*, int)' thread 7f0e2137e700 time 2017-09-26 18:56:10.427498
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]: /home/builder/source/ceph-12.2.0/src/msg/async/rdma/Infiniband.cc: 144: FAILED assert(active_port)
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc)
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55e9dde4bd12]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  2: (Device::binding_port(CephContext*, int)+0x573) [0x55e9de1b2c33]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  3: (Infiniband::init()+0x15f) [0x55e9de1b8f1f]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  4: (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, ConnectedSocket*)+0x4c) [0x55e9ddf2329c]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  5: (AsyncConnection::_process_connection()+0x446) [0x55e9de1a6d86]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  6: (AsyncConnection::process()+0x7f8) [0x55e9de1ac328]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  7: (EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1125) [0x55e9ddf198a5]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  8: (()+0x4c9288) [0x55e9ddf1d288]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  9: (()+0xb9e6f) [0x7f0e259d4e6f]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  10: (()+0x7494) [0x7f0e260d1494]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  11: (clone()+0x3f) [0x7f0e25149aff]
>> Sep 26 18:56:10 pve02 ceph-mgr[2233]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>>
>> any advice ?
>>
>>
>> --
>> Gerhard W. Recher
>>
>> net4sec UG (haftungsbeschränkt)
>> Leitenweg 6
>> 86929 Penzing
>>
>> +49 171 4802507
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux