Thanks for the review. On 08/04/24, Mark Zhang wrote: > On 4/3/2024 3:36 AM, Etienne AUJAMES wrote: > > External email: Use caution opening links or attachments > > > > > > Define new options in 'rdma_set_option' to override default CM retries > > ("Max CM retries") and timeouts ("Local CM Response Timeout" and "Remote > > CM Response Timeout"). > > > > Sometimes user-facting tunable is not preferred but let's see: > > https://lore.kernel.org/linux-rdma/EC1346C3-3888-4FFB-B302-1DB200AAE00D@xxxxxxxxxx/ Note here that we hit this timeout issue with RoCE on a routed network. This means that no ARP is sent directly to the remote host and rdma_resolve_addr() always exit without error. As mention in the ticket, this is still possible on a flat network. If a node become unresponsive, the application tries to reconnect to the node, remote IP is still in the ARP cache, rdma_resolve_addr() exits without error. > > > These options can be useful for RoCE networks (no SM) to decrease the > > overall connection timeout with an unreachable node (by default, it can > > take several minutes). > > > > This patch is not only for RoCE, right? Yes, you are right. But the connection timeout issue is seen only on RoCE for an unreachable node. With an Infiniband network, the "Subnet Manager" will return an empty "path record" and rdma_resolve_route() will return an error before calling rdma_connect(). So, the purpose of this patch is to mitigate this RoCE connection timeout issue. > > > Signed-off-by: Etienne AUJAMES <eaujames@xxxxxxx> > > --- > > drivers/infiniband/core/cma.c | 92 ++++++++++++++++++++++++++++-- > > drivers/infiniband/core/cma_priv.h | 4 ++ > > drivers/infiniband/core/ucma.c | 14 +++++ > > include/rdma/rdma_cm.h | 5 ++ > > include/uapi/rdma/rdma_user_cm.h | 4 +- > > 5 files changed, 113 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > > index 1e2cd7c8716e..cc73b9708862 100644 > > --- a/drivers/infiniband/core/cma.c > > +++ b/drivers/infiniband/core/cma.c > > @@ -1002,6 +1002,8 @@ __rdma_create_id(struct net *net, rdma_cm_event_handler event_handler, > > id_priv->tos_set = false; > > id_priv->timeout_set = false; > > id_priv->min_rnr_timer_set = false; > > + id_priv->max_cm_retries = false; > > + id_priv->cm_timeout = false; > > id_priv->gid_type = IB_GID_TYPE_IB; > > spin_lock_init(&id_priv->lock); > > mutex_init(&id_priv->qp_mutex); > > @@ -2845,6 +2847,80 @@ int rdma_set_min_rnr_timer(struct rdma_cm_id *id, u8 min_rnr_timer) > > } > > EXPORT_SYMBOL(rdma_set_min_rnr_timer); > > > > +/** > > + * rdma_set_cm_retries() - Set the maximum of CM retries of the QP associated > > + * with a connection identifier. > > This comment (and the one below) seems not accuruate, as it's not for the > QP. This is different from the rdma_set_ack_timeout(). What do you suggest? CM parameters are not used when the QP is connected, but to establish the connection. For me, those parameters are still associated with the connection. What about: /** * rdma_set_cm_retries() - Configure CM retries to establish an active * connection. * @id: Connection identifier to connect. > > > + * @id: Communication identifier associated with service type. > > + * @max_cm_retries: 4-bit value definied as "Max CM Retries" REQ field > > typo: definied -> defined Ack > > > + * (Table 99 "REQ Message Contents" in the IBTA specification). > > + * > > + * This function should be called before rdma_connect() on active side. > > + * The value will determine the maximum number of times the CM should > > + * resend a connection request or reply (REQ/REP) before giving up (UNREACHABLE > > + * event). > > + * > > + * Return: 0 for success > > + */ > > +int rdma_set_cm_retries(struct rdma_cm_id *id, u8 max_cm_retries) > > +{ > > + struct rdma_id_private *id_priv; > > + > > + /* It is a 4-bit value */ > > + if (max_cm_retries & 0xf0) > > + return -EINVAL; > > + > > + if (WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT)) > > + return -EINVAL; > > + > Maybe we don't need a warning here. > I think UD also need these 2 parameters, as UD also has Resp. see > cma_resolve_ib_udp() below. Yes, this seems right. UD used CM timeout to compute req.timeout_ms and CM retries for req.max_cm_retries. But unlike RC, those values are not sent on the wire. > > > + id_priv = container_of(id, struct rdma_id_private, id); > > + mutex_lock(&id_priv->qp_mutex); > > + id_priv->max_cm_retries = max_cm_retries; > > + id_priv->max_cm_retries_set = true; > > + mutex_unlock(&id_priv->qp_mutex); > > + > > + return 0; > > +} > > +EXPORT_SYMBOL(rdma_set_cm_retries); > > + > > +/** > > + * rdma_set_cm_timeout() - Set the CM timeouts of the QP associated with a > > + * connection identifier. > > + * @id: Communication identifier associated with service type. > > + * @cm_timeout: 5-bit value, expressed as 4.096 * 2^(timeout) usec. > > + * This value should be superior than 0. > > + * > > + * This function should be called before rdma_connect() on active side. > > + * The value will affect the "Remote CM Response Timeout" and the > > + * "Local CM Response Timeout" timeouts to respond to a connection request (REQ) > > + * and to wait for connection reply (REP) ack on the remote node. > > + * > > + * Round-trip timeouts for a REQ and REP requests: > > + * REQ_timeout_ms = remote_cm_response_timeout_ms + 2* PacketLifeTime_ms > > + * REP_timeout_ms = local_cm_response_timeout_ms > > + * > > + * Return: 0 for success > > + */ > > +int rdma_set_cm_timeout(struct rdma_cm_id *id, u8 cm_timeout) > > +{ > > + struct rdma_id_private *id_priv; > > + > > + /* It is a 5-bit value */ > > + if (!cm_timeout || (cm_timeout & 0xe0)) > > + return -EINVAL; > > + > > + if (WARN_ON(id->qp_type != IB_QPT_RC && id->qp_type != IB_QPT_XRC_TGT)) > > + return -EINVAL; > > likewise Ack > > + > > + id_priv = container_of(id, struct rdma_id_private, id); > > + mutex_lock(&id_priv->qp_mutex); > > + id_priv->cm_timeout = cm_timeout; > > + id_priv->cm_timeout_set = true; > > + mutex_unlock(&id_priv->qp_mutex); > > + > > + return 0; > > +} > > +EXPORT_SYMBOL(rdma_set_cm_timeout); > > + > > static int route_set_path_rec_inbound(struct cma_work *work, > > struct sa_path_rec *path_rec) > > { > > @@ -4295,8 +4371,11 @@ static int cma_resolve_ib_udp(struct rdma_id_private *id_priv, > > req.path = id_priv->id.route.path_rec; > > req.sgid_attr = id_priv->id.route.addr.dev_addr.sgid_attr; > > req.service_id = rdma_get_service_id(&id_priv->id, cma_dst_addr(id_priv)); > > - req.timeout_ms = 1 << (CMA_CM_RESPONSE_TIMEOUT - 8); > > - req.max_cm_retries = CMA_MAX_CM_RETRIES; > > + req.timeout_ms = id_priv->cm_timeout_set ? > > + id_priv->cm_timeout : CMA_CM_RESPONSE_TIMEOUT; > > + req.timeout_ms = 1 << (req.timeout_ms - 8); > > So req.timeout_ms must be greater than 8? Maybe we need to check in > rdma_set_cm_timeout(). Yes. For RC the active connection timeout is given by: static inline int cm_convert_to_ms(int iba_time) { /* approximate conversion to ms from 4.096us x 2^iba_time */ return 1 << max(iba_time - 8, 0); } int ib_send_cm_req(struct ib_cm_id *cm_id, .... cm_id_priv->timeout_ms = cm_convert_to_ms( param->primary_path->packet_life_time) * 2 + cm_convert_to_ms( param->remote_cm_response_timeout); So, we can check the timeout value in rdma_set_cm_timeout() or modify cma_resolve_ib_udp() to match cm_convert_to_ms() implementation: - req.timeout_ms = 1 << (req.timeout_ms - 8); + req.timeout_ms = 1 << max(req.timeout_ms - 8, 0); Etienne