Re: Apparent issue with fragmentation and config changes

Jan Friesse <jfriesse@xxxxxxxxxx> · Wed, 11 Dec 2013 14:02:33 +0100

JC,
let me quicky explain what was original intents. So basically, let's say
there is membership change and there are existing (queued) messages in
the totemsrp buffer. We need to make sure that sync messages are sent
BEFORE any message. So that's why original trans_ack was introduced.

Sadly fragmentation layer created another problem, because what if there
are messages where some of fragments are delivered and some are not.
Then if sync messages get priority, we will loose content of fragment
buffer (it's basically cleared).

This is solved by patch you are talking about with adding another
fragmentation queue. So now, there are 2 queues. One for transitional
state and second for normal state.

JC Hugly napsal(a):
> Dear Corosync authors,
> 
> Due to libqb license issues, I work with version 1.4.6, but it seems that the code in question is the same in 2.x.
> 
> I seem to have stumbled on a few issues related to fragmentation in combination with config changes. 
> 
> The main issue is this:
> Sometimes the first totem message delivered during the transitional configuration is the continuation of a messages that was delivered before. Similarly the last message delivered during the transitional configuration can be fragmented into the next message.
> 
> In both these cases, reassembly fails since the reassembly context is changed during the transitional configuration (per the patch signed off by Jan Friesse on 11/8/2012).
> 
> I am not sure which part is a bug: that messages can continue each other across a transitional configuration boundary, or that the reassembly context gets changed, but the two things cannot work together.
> 
> A couple of side issues are that:
> 
> 1 - The fragmentation code resets the next fragment number to 1 whenever it can fit a message in the send buffer; no matter that the buffer may be currently accumulating data for fragment 2 or 3 or what not. That messes up the reassembly code.

Yes, but also queue should be changed.

> 
> 2 - Whenever the re-assembly code hits a fragment that does not stitch, it starts discarding everything until a first fragment shows up (although I am not sure it always achieves that; see point 1). I believe the intent was to drop only the one or two application message pieces that can't be stitched. I have an alternate, much simpler writing of totempg_deliver_fn that does just that, but we can talk about it later. I suspect that fragments that don't connect are not supposed to happen at all and that I see that only because of the main issue I described above. Am I suspecting right?

Yes, fragments which doesn't connect simply shouldn't happen.

> 
> If you have an idea about how to deal with fragmentation across transitional configuration boundaries, I will be more than happy to try out things for you. I have a test program that can produce these problems at will (I don't want to get into how I do that, just yet).
> 

I was looking to your patch and I'm unsure if it is correct or not. I'm
trying to test it right now.

Regards,
  Honza

> Thanks a lot for reading thus far.
> 
> J-C
> 
> 
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss