On Fri, Nov 15, 2013 at 2:29 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
On 11/14/2013 02:22 AM, Christine Caulfield wrote:Endian conversion happens on receipt of the message and is based upon a field in the message indicating which endian the message was originated with. If a message is changed in a retransmit queue, I would expect it's endian field is also modified, resulting in newly transmitted messages being correctly decoded by the receivers.
On 14/11/13 05:01, John Thompson wrote:
Hi,
I am using corosync in a cluster that includes both big and little
endian systems and am coming
across crashes when there are retransmits in the cluster.
I wondered therefore if others had tried this previously?
As part of this I have identified that totempg_deliver_fn modifies the
mcast msg in place to
convert for endian purposes, even though it might still be on a sort
queue and used for retransmission.
This means that if there are different endian systems operating and a
retransmission of the msg
is performed, it will have been endian converted in-place and so what
the node receives is a message that has some endian converted fields.
I will submit a patch for this.
When totem was originally written in Corosync, we had ppc, arm, and x86_64 as all major platforms for Corosync. But corosync hasn't been tried in years on these platforms. It did work grand at one point ;) Most of the world has moved to x86_64 so the need hasn't presented itself to focus on this area of the code base lately.
I suspect it hasn't been tried for a very long time! if you have a patch that fixes the bug it will be gratefully received :-)
Chrissie
Thanks for the responses.
I was trying out corosync in a cluster with a big endian & 4 little endian systems. When there was a degree of packet loss, that lead to retransmissions occurring, a crash would occur. This I worked out was in
totempg_deliver_fn where the mcast->msg_count field was VERY high. When checking the number out it looked to be endian swapped. So I tried out endian swapping to a local variable in this function and the
totempg_deliver_fn crashes no longer occur.
totempg_deliver_fn where the mcast->msg_count field was VERY high. When checking the number out it looked to be endian swapped. So I tried out endian swapping to a local variable in this function and the
totempg_deliver_fn crashes no longer occur.
I have looked into it further and believe this is because totemsrp.c:messages_deliver_to_app (which ends up calling totempg_deliver_fn) is delivering whilst the msg remains on the regular_sort_queue which can be used for
retransmission purposes. This therefore means that if the msg_count gets endian swapped in place and the message has to be retransmitted then the node that requested the retransmission gets a message where the
msg_count has been previously endian swapped.
retransmission purposes. This therefore means that if the msg_count gets endian swapped in place and the message has to be retransmitted then the node that requested the retransmission gets a message where the
msg_count has been previously endian swapped.
I have sent in a patch that resolves this problem. The only problem I have with it is what I have changed around the fragmentation case. I think I have this wrong and am preparing the patch to get this right.
Thanks,
John
John
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss