infiniband and redunant mode

Evgeny Barskiy <barskiy@xxxxxx> · Fri, 26 Oct 2012 20:44:41 +0400



    Hello,

    
    Here is some changes to allow corosync run in iba + rrp mode 

    (problem #2, described in
    
    http://lists.corosync.org/pipermail/discuss/2012-October/002086.html):

    
    ------ totemiba.c line 1031 send_token_unbind

    +if(instance->send_token_ah)

    +{

    +       ibv_destroy_ah(instance->send_token_ah);

    +       instance->send_token_ah = 0;

    +}

    
    ------ totemiba.c line 1419 totemiba_token_send

    
    +if(instance->send_token_ah)

            res = ibv_post_send (instance->send_token_cma_id->qp,
    &send_wr, &failed_send_wr);

    
    It looks like its initializing, joining cpg and running normally
    after this small fix

    
    The other one problem

    (problem #1, described in http://lists.corosync.org/pipermail/discuss/2012-October/002086.html):

    is kinda more interesting, yes its only occurs during program
    strarting but this is just side effect

    
    What if we have one switch down during corosync start?

    If its first switch - assert in memb_ring_id_create_or_load

    If its any other one -  infinite loop or even segfault

    
    this event will never happen:

    
    	case RDMA_CM_EVENT_MULTICAST_JOIN:
		instance->mcast_qpn = event->param.ud.qp_num;
		instance->mcast_qkey = event->param.ud.qkey;
		instance->mcast_ah = ibv_create_ah (instance->mcast_pd, &ev!
 ent
->param.ud.ah_attr);

		instance->totemiba_iface_change_fn (instance->rrp_context, &instance->my_id);
		break;
    so main_iface_change_fn will not be called enough times and we will
    not enter to the gathering state

    
    in totemudp there is checking if interface is down and even if it
    down we call main_iface_change function

    
    so I think  somwhere here

    
    static void timer_function_netif_check_timeout (
      void *data){
	struct totemiba_instance *instance = (struct totemiba_instance *)data;
	in
t res;
	int interface_up;
	int interface_num;
	int addr_len;

	totemip_iface_check (&instance->totem_interface->bindnet,
		&instance->totem_interface->boundto, &interface_up, &inte
rface_num, instance->totem_config->clear_node_high_bit);


    we should at least check "interface_up" variable
    like in udp version

    also we should probably setup timer which will retry to initialize
    it later 

    
    Next question is if its possible just to loose RDMA_CM_EVENT_MULTICAST_JOIN
    event? 

    If yes, we will have infinite loop this way, probable some timer is
    required? 

    
    Evgeny

  
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss