Re: Has anyone used corosync with both big & little endian systems in a single cluster?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Fri, Nov 15, 2013 at 2:29 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
On 11/14/2013 02:22 AM, Christine Caulfield wrote:
On 14/11/13 05:01, John Thompson wrote:
Hi,

I am using corosync in a cluster that includes both big and little
endian systems and am coming
across crashes when there are retransmits in the cluster.

I wondered therefore if others had tried this previously?

As part of this I have identified that totempg_deliver_fn modifies the
mcast msg in place to
convert for endian purposes, even though it might still be on a sort
queue and used for retransmission.
This means that if there are different endian systems operating and a
retransmission of the msg
is performed, it will have been endian converted in-place and so what
the node receives is a message that has some endian converted fields.

I will submit a patch for this.

Endian conversion happens on receipt of the message and is based upon a field in the message indicating which endian the message was originated with.  If a message is changed in a retransmit queue, I would expect it's endian field is also modified, resulting in newly transmitted messages being correctly decoded by the receivers.

When totem was originally written in Corosync, we had ppc, arm, and x86_64 as all major platforms for Corosync.  But corosync hasn't been tried in years on these platforms.  It did work grand at one point ;)  Most of the world has moved to x86_64 so the need hasn't presented itself to focus on this area of the code base lately.



I suspect it hasn't been tried for a very long time! if you have a patch that fixes the bug it will be gratefully received :-)

Chrissie


Thanks for the responses.

I was trying out corosync in a cluster with a big endian & 4 little endian systems.  When there was a degree of packet loss, that lead to retransmissions occurring, a crash would occur.  This I worked out was in
totempg_deliver_fn where the mcast->msg_count field was VERY high.  When checking the number out it looked to be endian swapped.  So I tried out endian swapping to a local variable in this function and the
totempg_deliver_fn crashes no longer occur.

I have looked into it further and believe this is because totemsrp.c:messages_deliver_to_app (which ends up calling totempg_deliver_fn) is delivering whilst the msg remains on the regular_sort_queue which can be used for
retransmission purposes.  This therefore means that if the msg_count gets endian swapped in place and the message has to be retransmitted then the node that requested the retransmission gets a message where the
msg_count has been previously endian swapped.

I have sent in a patch that resolves this problem.  The only problem I have with it is what I have changed around the fragmentation case.  I think I have this wrong and am preparing the patch to get this right.

Thanks,
John
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux