Re: Has anyone used corosync with both big & little endian systems in a single cluster?

John Thompson <thompa26@xxxxxxxxx> · Mon, 18 Nov 2013 17:39:28 +1300

On Fri, Nov 15, 2013 at 2:29 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:

On 11/14/2013 02:22 AM, Christine Caulfield wrote:

On 14/11/13 05:01, John Thompson wrote:

Hi,

I am using corosync in a cluster that includes both big and little

endian systems and am coming

across crashes when there are retransmits in the cluster.

I wondered therefore if others had tried this previously?

As part of this I have identified that totempg_deliver_fn modifies the

mcast msg in place to

convert for endian purposes, even though it might still be on a sort

queue and used for retransmission.

This means that if there are different endian systems operating and a

retransmission of the msg

is performed, it will have been endian converted in-place and so what

the node receives is a message that has some endian converted fields.

I will submit a patch for this.

Endian conversion happens on receipt of the message and is based upon a field in the message indicating which endian the message was originated with.  If a message is changed in a retransmit queue, I would expect it's endian field is also modified, resulting in newly transmitted messages being correctly decoded by the receivers.

When totem was originally written in Corosync, we had ppc, arm, and x86_64 as all major platforms for Corosync.  But corosync hasn't been tried in years on these platforms.  It did work grand at one point ;)  Most of the world has moved to x86_64 so the need hasn't presented itself to focus on this area of the code base lately.

I suspect it hasn't been tried for a very long time! if you have a patch that fixes the bug it will be gratefully received :-)

Chrissie

Thanks for the responses.

I was trying out corosync in a cluster with a big endian & 4 little endian systems.  When there was a degree of packet loss, that lead to retransmissions occurring, a crash would occur.  This I worked out was in

totempg_deliver_fn where the mcast->msg_count field was VERY high.  When checking the number out it looked to be endian swapped.  So I tried out endian swapping to a local variable in this function and the
totempg_deliver_fn crashes no longer occur.

I have looked into it further and believe this is because totemsrp.c:messages_deliver_to_app (which ends up calling totempg_deliver_fn) is delivering whilst the msg remains on the regular_sort_queue which can be used for

retransmission purposes.  This therefore means that if the msg_count gets endian swapped in place and the message has to be retransmitted then the node that requested the retransmission gets a message where the
msg_count has been previously endian swapped.

I have sent in a patch that resolves this problem.  The only problem I have with it is what I have changed around the fragmentation case.  I think I have this wrong and am preparing the patch to get this right.

Thanks,
John

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss