Hi Chrissie,
Thanks, I will send a mail to this mailing list about this patch.
On Aug 7, 2014 9:35 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx> wrote:
On 06/08/14 02:09, jason wrote:
Sorry, the previous patch is wrong. Here is the correction.
That looks good to me and, I think, the best solution. It seems to be decidedly non-trivial to determine if retransmits are present when going into hold.
Thanks!
Chrissie
On Aug 5, 2014 10:18 PM, "Christine Caulfield" <ccaulfie@xxxxxxxxxx
<mailto:ccaulfie@xxxxxxxxxx>> wrote:
Hi Jason,
Thanks for testing that - and the extra info. I'll have another
think then. If I can't come up with anything more we might go with
your patch.
Chrissie
On 05/08/14 13:01, jason wrote:
Hi Christine,
I have tested your patch but it can not solve my problem. By adding
printf, I found that whenever during retransmition occured in my
test
case or not, the retrans_message_queue is always empty. It seems
that
the retrans_message_queue is for recovery state used only?
On Aug 5, 2014 3:50 PM, "Christine Caulfield"
<ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>
<mailto:ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>>> wrote:
On 01/08/14 10:50, Christine Caulfield wrote:
On 01/08/14 10:42, Jan Friesse wrote:
Jason,
Hi All,
I have encountered a problem that when there is
no other
activty on
ring but
only retransmition, and token is in hold mode, the
retransmition will
become
slow. More over, if the retransmition is always
fail but
token
Yes
rotation works well,
then it takes quite a lone
time(fail_to_recv_const *
token_hold = 2500
* 180ms = 450sec) for the retransmiting node to
meet the
"FAILED TO
RECEIVE" condition to
re-construct a new ring. This can be reporduced
by the
following steps:
1) Create a two-node cluster in udpu
transport mode.
2) Wait until there is no other activty
on ring.
3) One, or both nodes delete each other
in nodelist in
corosync.conf
4) corosync-cfgtool -R, this can cause a
message
retransmition,
but I am
not sure why.
5) Since tokenrotation still works well,
but the
retransmition
can not be
satisfied due to node deletion, so, only
"FAILED
TO RECEIVE"
condition can form new
ring. But we need to wait 450 seconds for
it to
happen. During
this wait,
we saw the following logs:
This is really weird case.
Jul 30 11:21:06 notice [TOTEM ]
Retransmit List: e
Jul 30 11:21:06 notice [TOTEM ]
Retransmit List: e
Jul 30 11:21:06 notice [TOTEM ]
Retransmit List: e
Jul 30 11:21:06 notice [TOTEM ]
Retransmit List: e
Jul 30 11:21:06 notice [TOTEM ]
Retransmit List: e
...
This problem can be solved by adding
token_hold_cancel_send() in both
retransmition request and response conditions in
orf_token_rtr() to
speed up
retransmition. I created a patch below, any
comments?
Ok. Patch looks fine, but during review I had other
idea.
What about
prohibit starting of hold mode where there are
messages to
retransmit?
Such solution may be cleaner, isn't it?
Anyway. This is change in very critical part of the
code, so
Chrissie,
can you please take a look to patch and express
your opinion?
I've been looking it over yesterday. It's a problem I have
definitely
seen myself on some VM systems so it's certainly not an
isolated
case. I
think Honza is right that there might be a better way
of fixing
it so
I'll have a look.
Chrissie
Annoyingly my common reproducer seems not to be working and
I can't
get yours to make it happen either. If you can still
reproduce it
could you try this patch for me please?
Chrissie
_________________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
<mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>>
http://lists.corosync.org/__mailman/listinfo/discuss
<http://lists.corosync.org/mailman/listinfo/discuss>
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss