investigation of {transport: iba + rrp_mode: active} mode issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello there,

Recently I investigated some issues described by Vladimir Voznesensky here
http://lists.corosync.org/pipermail/discuss/2012-September/002008.html

the first one:

# corosync -f
notice [MAIN ] Corosync Cluster Engine ('2.0.1'): started and ready to provide service.
info    [MAIN  ] Corosync built-in features: testagents rdma monitoring
Sep 20 11:28:50 notice  [TOTEM ] Initializing transport (Infiniband/IP).
Sep 20 11:28:50 notice  [TOTEM ] Initializing transport (Infiniband/IP).
corosync: totemsrp.c:3236: memb_ring_id_create_or_load: Assertion `!totemip_zero_check(&memb_ring_id->rep)' failed.

This happens with 1/2 probability when main_iface_change_fn (totemsrp.c) firstly called for the second interface, cause memb_ring_id_create_or_load function uses first interface ip adress to form filename which is not initialized yet. So when RDMA_CM_EVENT_MULTICAST_JOIN in mcast_rdma_event_fn (totemiba.c) firstly happens for the second ring we have this assert.

the second one:

...
Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:1979 entering GATHER state from 15. Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:3021 Creating commit token because I am the rep. Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:1471 Saving state aru 0 high seq received 0 Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:3276 Storing new sequence id for ring 33e0
Oct 15 12:46:28 debug [TOTEM ] totemsrp.c:2035 entering COMMIT state.
and then segmentation fault

this one is more serious since it happens with 1 probability and even if we pass it once it will catch us during synchronization with other processes.

Segmentation fault occurs in infiniband library when we try to post message with uninitialized send_token_ah,
lets look on send_token_rdma_event_fn (totemiba.c) code:

...
case RDMA_CM_EVENT_ESTABLISHED: instance->send_token_qpn = event->param.ud.qp_num;
instance->send_token_qkey = event->param.ud.qkey;
instance->send_token_ah = ibv_create_ah (instance->send_token_pd, &event->param.ud.ah_attr);
instance->totemiba_target_set_completed (instance->rrp_context); break;
...

here we initialize send_token_ah and then call totemiba_target_set_completed which is call totemiba_token_send and inside it ibv_post_send with send_token_ah as parameter. It looks fine but when we have two interfaces calling will pass via active context to both interfaces to this one and to other one with uninitialized send_token_ah equaled to 0. I tried several cosmetic changes but they didnt rly helped since probably send_token_ah is same way incorrectly reinitializing during cpg synchronization.

Also, I didn't found ibv_destroy_ah calling anywhere in code.

Thank you
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux