opensm + ceph RDMA = Down OSDs

"westjoshuaalan@xxxxxxxxx" <westjoshuaalan@xxxxxxxxx> · Tue, 16 Mar 2021 06:15:19 -0600

Hello,

I understand that this mailing list may be able to help me with an
issue I am experiencing specific to opensm/infiniband rdma!

Forgive me if I am not in the right place though, if not, I would
appreciate a pointer to the next step in my journey!

I am having a recurring issue where after some period of working
without issue, I come to find several OSDs in my ceph cluster offline,
each reporting the same clock skew error in their systemd unit log
(`systemctl status ceph-osd@##`):
>
> Mar 12 00:39:28 rd240 ceph-osd[1655164]: 2021-03-12T00:39:28.006-0700 7f0d635b2700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-03-11T23:39:28.012146-0700)
> Mar 12 00:39:29 rd240 ceph-osd[1655164]: 2021-03-12T00:39:29.006-0700 7f0d635b2700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-03-11T23:39:29.012281-0700)
> Mar 12 00:39:30 rd240 ceph-osd[1655164]: 2021-03-12T00:39:30.007-0700 7f0d635b2700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-03-11T23:39:30.012420-0700)
>
>  (Ignore the fact it's right at midnight, which I beleive is simply from log rotation.)

When resetting the osds, they all come back online without issue, but
at some point in the future are expected to go down again.

See attached the two opensm logs from the node on which all osd
failures occurred today.

Here is the only message from the opensm master (I have three nodes in
total, opensm running on each, and different opensm priorities for
each port in an attempt to create redundancy)

> Mar 11 19:32:26 717944 [EC430700] 0x01 -> log_send_error: ERR 5411: DR SMP Send completed with error (IB_TIMEOUT) -- dropping
>                         Method 0x1, Attr 0x20, TID 0x11f51
> Mar 11 19:32:26 718212 [EC430700] 0x01 -> Received SMP on a 2 hop path: Initial path = 0,1,3, Return path  = 0,0,0
> Mar 11 19:32:26 718223 [EC430700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(SMInfo), attr_mod 0x0, TID 0x11f51
> Mar 11 19:32:26 718227 [EC430700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3120: Timeout while getting attribute 0x20 (SMInfo); Possible mis-set mkey?
> Mar 11 19:34:56 733934 [EC430700] 0x01 -> log_send_error: ERR 5411: DR SMP Send completed with error (IB_TIMEOUT) -- dropping
>                         Method 0x1, Attr 0x20, TID 0x11f8d
> Mar 11 19:34:56 733965 [EC430700] 0x01 -> Received SMP on a 2 hop path: Initial path = 0,1,3, Return path  = 0,0,0
> Mar 11 19:34:56 733972 [EC430700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(SMInfo), attr_mod 0x0, TID 0x11f8d
> Mar 11 19:34:56 733977 [EC430700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3120: Timeout while getting attribute 0x20 (SMInfo); Possible mis-set mkey?
> Mar 11 19:34:56 733993 [EC430700] 0x01 -> log_send_error: ERR 5411: DR SMP Send completed with error (IB_TIMEOUT) -- dropping
>                         Method 0x1, Attr 0x20, TID 0x11f8e
> Mar 11 19:34:56 733999 [EC430700] 0x01 -> Received SMP on a 2 hop path: Initial path = 0,1,2, Return path  = 0,0,0
> Mar 11 19:34:56 734003 [EC430700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(SMInfo), attr_mod 0x0, TID 0x11f8e
> Mar 11 19:34:56 734007 [EC430700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3120: Timeout while getting attribute 0x20 (SMInfo); Possible mis-set mkey?
> Mar 11 19:35:26 725924 [EC430700] 0x01 -> log_send_error: ERR 5411: DR SMP Send completed with error (IB_TIMEOUT) -- dropping
>                         Method 0x1, Attr 0x20, TID 0x11f99

--> That "Return path = 0,0,0" strikes me as possibly incorrect?
--> This is just repeated until failed ceph osd's restarted

Investigating the "Possible mis-set mkey?" line, I believe it is the
default, as I haven't expressly set it to my memory, but haven't
figured out how to check either.
I suspect the error is in my opensm config..
opensm are each run via:
/etc/init.d/opensm:
>
> ...
> start-stop-daemon --start --quiet --make-pidfile --pidfile /var/run/opensm-0x7cfe900300179b31 --background --exec \
> /usr/sbin/opensm -- \
>
> --daemon \
>
> -g 0x7cfe900300179b31 \
>
> -p /etc/opensm/partitions.conf \
>
> -f /var/log/opensm.0x7cfe900300179b31.log \
>
> --priority 11 \ #changed per port, no overlapping priorities
>
> --port-shifting \
>
> --ucast_cache \
>
> --do_mesh_analysis \
>
> --lmc 0 \
>
> -R ftree,updn,minhop \
>
> --part_enforce both \
>
> --allow_both_pkeys
> ...

Digging into this error, I checked and found iblink error counters as
follows, mainly on (i believe) the IB switch itself?

> # ibqueryerrors
> Errors for "dl380g7 mlx4_0"
>    GUID 0x7cfe900300179b31 port 1: [VL15Dropped == 14] [PortXmitWait == 4294967295]
>    GUID 0x7cfe900300179b32 port 2: [PortXmitWait == 2]
> Errors for "server mlx4_0"
>    GUID 0x2c9030042dff1 port 1: [PortXmitWait == 4294967295]
> Errors for 0x2c903006e29b0 "MF0;ys23ib4:SX6036/U1"
>    GUID 0x2c903006e29b0 port ALL: [LinkErrorRecoveryCounter == 15] [LinkDownedCounter == 48] [PortRcvSwitchRelayErrors == 2461] [PortXmitDiscards == 15854] [VL15Dropped == 5] [PortXmitWait == 4294967295]
>    GUID 0x2c903006e29b0 port 0: [PortXmitWait == 76959324]
>    GUID 0x2c903006e29b0 port 1: [LinkDownedCounter == 8] [PortRcvSwitchRelayErrors == 2] [PortXmitWait == 4294967295]
>    GUID 0x2c903006e29b0 port 2: [LinkDownedCounter == 5] [PortRcvSwitchRelayErrors == 4] [VL15Dropped == 2] [PortXmitWait == 4294967295]
>    GUID 0x2c903006e29b0 port 3: [LinkDownedCounter == 5] [PortRcvSwitchRelayErrors == 2] [VL15Dropped == 3] [PortXmitWait == 4294967295]
>    GUID 0x2c903006e29b0 port 4: [LinkDownedCounter == 8] [PortRcvSwitchRelayErrors == 2] [PortXmitWait == 2]
>    GUID 0x2c903006e29b0 port 5: [SymbolErrorCounter == 65535] [LinkErrorRecoveryCounter == 8] [LinkDownedCounter == 2] [PortRcvSwitchRelayErrors == 149] [PortXmitDiscards == 649] [PortXmitWait == 84704351]
>    GUID 0x2c903006e29b0 port 6: [LinkDownedCounter == 9] [PortRcvSwitchRelayErrors == 2285] [PortXmitDiscards == 14176] [PortXmitWait == 4294967295]
>    GUID 0x2c903006e29b0 port 7: [SymbolErrorCounter == 65535] [LinkErrorRecoveryCounter == 7] [LinkDownedCounter == 2] [PortRcvSwitchRelayErrors == 1]
>    GUID 0x2c903006e29b0 port 8: [LinkDownedCounter == 9] [PortRcvSwitchRelayErrors == 16] [PortXmitDiscards == 1029]
> Errors for "rd240 mlx4_0"
>    GUID 0xe41d2d0300e0bae1 port 1: [PortXmitWait == 4294967295]
>    GUID 0xe41d2d0300e0bae2 port 2: [PortXmitWait == 1]
>
> ## Summary: 4 nodes checked, 4 bad nodes found
> ##          43 ports checked, 14 ports have errors beyond threshold
> ## Thresholds:
> ## Suppressed:

par
I do not have an opensm.conf, instead setting the config via the above
init.d script, but I do have partitions.conf as:
>
> Default=0x7fff, rate=7, mtu=4, scope=2, defmember=full:
>         ALL=full, SELF=full, ALL_SWITCHES=full;
> Default=0x7fff, ipoib, rate=7, mtu=4, scope=2:
> mgid=ff12:401b:ffff::ffff:ffff  # JW from error log
>         mgid=ff12:401b::ffff:ffff       # IPv4 Broadcast address
>         mgid=ff12:401b::1               # IPv4 All Hosts group
>         mgid=ff12:401b::2               # IPv4 All Routers group
>         mgid=ff12:401b::16              # IPv4 IGMP group
>         mgid=ff12:401b::fb              # IPv4 mDNS group
>         mgid=ff12:401b::fc              # IPv4 Multicast Link Local Name Resolution group
>         mgid=ff12:401b::101             # IPv4 NTP group
>         mgid=ff12:401b::202             # IPv4 Sun RPC
>         mgid=ff12:601b::1               # IPv6 All Hosts group
>         mgid=ff12:601b::2               # IPv6 All Routers group
>         mgid=ff12:601b::16              # IPv6 MLDv2-capable Routers group
>         mgid=ff12:601b::fb              # IPv6 mDNS group
>         mgid=ff12:601b::101             # IPv6 NTP group
>         mgid=ff12:601b::202             # IPv6 Sun RPC group
>         mgid=ff12:601b::1:3             # IPv6 Multicast Link Local Name Resolution group
>         ALL=full, SELF=full, ALL_SWITCHES=full;

I am hopeful that this group may be able to point me in the right
direction. I am 100% self taught, and have come a long way, but this
issue has been haunting me for awhile now.
It's not a production cluster, I am just learning and playing with
stuff I find really interesting, but the double edged sword is that
now I am feeling pretty lost...

Any help would be greatly appreciated!
Attachment:
opensm.1.log.gz

Description: application/gzip
Attachment:
opensm.2.log.gz

Description: application/gzip