Hi, thanks for investigating problem. I'm also trying to find out WHAT is the main problem but sadly, it was more like "try to find IBA HW" ;) (and sadly nether softiwarp nor soft roce was working). Evgeny Barskiy napsal(a): > Hello, > > Here is some changes to allow corosync run in iba + rrp mode > (problem #2, described in > http://lists.corosync.org/pipermail/discuss/2012-October/002086.html): > > ------ totemiba.c line 1031 send_token_unbind > +if(instance->send_token_ah) > +{ > + ibv_destroy_ah(instance->send_token_ah); > + instance->send_token_ah = 0; > +} > > ------ totemiba.c line 1419 totemiba_token_send > > +if(instance->send_token_ah) > res = ibv_post_send (instance->send_token_cma_id->qp, &send_wr, > &failed_send_wr); > > I'm unsure if this is really correct solution (maybe it is), but send_token_ah shouldn't really be NULL. > > It looks like its initializing, joining cpg and running normally after > this small fix > > > The other one problem > (problem #1, described in > http://lists.corosync.org/pipermail/discuss/2012-October/002086.html): > is kinda more interesting, yes its only occurs during program strarting > but this is just side effect > > What if we have one switch down during corosync start? > If its first switch - assert in memb_ring_id_create_or_load > If its any other one - infinite loop or even segfault > > this event will never happen: > > case RDMA_CM_EVENT_MULTICAST_JOIN: > instance->mcast_qpn = event->param.ud.qp_num; > instance->mcast_qkey = event->param.ud.qkey; > instance->mcast_ah = ibv_create_ah (instance->mcast_pd, > &event->param.ud.ah_attr); > instance->totemiba_iface_change_fn (instance->rrp_context, > &instance->my_id); > break; > > so main_iface_change_fn will not be called enough times and we will not > enter to the gathering state > > in totemudp there is checking if interface is down and even if it down > we call main_iface_change function > > so I think somwhere here > > static void timer_function_netif_check_timeout ( > void *data) > { > struct totemiba_instance *instance = (struct totemiba_instance *)data; > int res; > int interface_up; > int interface_num; > int addr_len; > totemip_iface_check (&instance->totem_interface->bindnet, > &instance->totem_interface->boundto, &interface_up, &interface_num, > instance->totem_config->clear_node_high_bit); > > we should at least check "interface_up" variablelike in udp version > also we should probably setup timer which will retry to initialize it later > exactly > Next question is if its possible just to loose > RDMA_CM_EVENT_MULTICAST_JOIN event? > If yes, we will have infinite loop this way, probable some timer is > required? > I'm really unsure there. > Evgeny > > > Thanks for your investigation. Hopefully we will be able to find proper solution soon. Regards, Honza > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss