infiniband and redunant mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Here is some changes to allow corosync run in iba + rrp mode
(problem #2, described in http://lists.corosync.org/pipermail/discuss/2012-October/002086.html):

------ totemiba.c line 1031 send_token_unbind
+if(instance->send_token_ah)
+{
+       ibv_destroy_ah(instance->send_token_ah);
+       instance->send_token_ah = 0;
+}

------ totemiba.c line 1419 totemiba_token_send

+if(instance->send_token_ah)
        res = ibv_post_send (instance->send_token_cma_id->qp, &send_wr, &failed_send_wr);



It looks like its initializing, joining cpg and running normally after this small fix


The other one problem
(problem #1, described in http://lists.corosync.org/pipermail/discuss/2012-October/002086.html):
is kinda more interesting, yes its only occurs during program strarting but this is just side effect

What if we have one switch down during corosync start?
If its first switch - assert in memb_ring_id_create_or_load
If its any other one -  infinite loop or even segfault

this event will never happen:
case RDMA_CM_EVENT_MULTICAST_JOIN:
instance->mcast_qpn = event->param.ud.qp_num;
instance->mcast_qkey = event->param.ud.qkey;
instance->mcast_ah = ibv_create_ah (instance->mcast_pd, &ev! ent ->param.ud.ah_attr);
instance->totemiba_iface_change_fn (instance->rrp_context, &instance->my_id);
break;
so main_iface_change_fn will not be called enough times and we will not enter to the gathering state

in totemudp there is checking if interface is down and even if it down we call main_iface_change function

so I think  somwhere here

static void timer_function_netif_check_timeout (
      void *data)
{
struct totemiba_instance *instance = (struct totemiba_instance *)data;
in t res;
int interface_up;
int interface_num;
int addr_len;
totemip_iface_check (&instance->totem_interface->bindnet,
&instance->totem_interface->boundto, &interface_up, &inte rface_num, instance->totem_config->clear_node_high_bit);
we should at least check "interface_up" variable like in udp version
also we should probably setup timer which will retry to initialize it later

Next question is if its possible just to loose RDMA_CM_EVENT_MULTICAST_JOIN event?
If yes, we will have infinite loop this way, probable some timer is required?

Evgeny
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux