Hello there,
Recently I investigated some issues described by Vladimir Voznesensky here
http://lists.corosync.org/pipermail/discuss/2012-September/002008.html
the first one:
# corosync -f
notice [MAIN ] Corosync Cluster Engine ('2.0.1'): started and ready to
provide service.
info [MAIN ] Corosync built-in features: testagents rdma monitoring
Sep 20 11:28:50 notice [TOTEM ] Initializing transport (Infiniband/IP).
Sep 20 11:28:50 notice [TOTEM ] Initializing transport (Infiniband/IP).
corosync: totemsrp.c:3236: memb_ring_id_create_or_load: Assertion
`!totemip_zero_check(&memb_ring_id->rep)' failed.
This happens with 1/2 probability when main_iface_change_fn (totemsrp.c)
firstly called for the second interface, cause
memb_ring_id_create_or_load function uses first interface ip adress to
form filename which is not initialized yet. So when
RDMA_CM_EVENT_MULTICAST_JOIN in mcast_rdma_event_fn (totemiba.c) firstly
happens for the second ring we have this assert.
the second one:
...
Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:1979 entering GATHER state
from 15.
Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:3021 Creating commit token
because I am the rep.
Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:1471 Saving state aru 0 high
seq received 0
Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:3276 Storing new sequence id
for ring 33e0
Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:2035 entering COMMIT state.
and then segmentation fault
this one is more serious since it happens with 1 probability and even if
we pass it once it will catch us during synchronization with other
processes.
Segmentation fault occurs in infiniband library when we try to post
message with uninitialized send_token_ah,
lets look on send_token_rdma_event_fn (totemiba.c) code:
...
case RDMA_CM_EVENT_ESTABLISHED: instance->send_token_qpn =
event->param.ud.qp_num;
instance->send_token_qkey = event->param.ud.qkey;
instance->send_token_ah = ibv_create_ah (instance->send_token_pd,
&event->param.ud.ah_attr);
instance->totemiba_target_set_completed (instance->rrp_context); break;
...
here we initialize send_token_ah and then call
totemiba_target_set_completed which is call totemiba_token_send and
inside it ibv_post_send with send_token_ah as parameter. It looks fine
but when we have two interfaces calling will pass via active context to
both interfaces to this one and to other one with uninitialized
send_token_ah equaled to 0. I tried several cosmetic changes but they
didnt rly helped since probably send_token_ah is same way incorrectly
reinitializing during cpg synchronization.
Also, I didn't found ibv_destroy_ah calling anywhere in code.
Thank you
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss