On 02/01/2018 03:34 AM, Matthias Walther wrote:
Hello Grant,
Hi Matthias,
I think I missed an email, as I don't know whom you're quoting here.
I was actually quoting myself in an email I sent about 7 hours prior. Here's a link to it: https://www.spinics.net/lists/lartc/msg23508.html
So after all it's a race condition during start up. It's awesome, that you found the cause! Thanks for all your work, you put into this.
You're welcome.
Last night, I managed to make all connections work by executing conntrack -D on the hypervisor. Awesome!
Yay!
But this morning, there were some broken tunnels again. This doesn't seem to last very long.
Hum. :-/
You wrote, that I should change the order in startup. So I just should postpone the starting of the VMs for a little while? Or do I need to change the order in my iptables rules somehow?
I think it's an issue between when the IPTables rules are entered vs when the GRE tunnels are brought up.
You might not have the ability to control when GRE packets come in from the remote sites. Thus connection tracking may learn about something before IPTables is ready.
I think that you will need to do some more digging into connection tracking and how to interpret the output. At least enough so that you can learn what is necessary to surgically add / remove entries to the connection tracking table. That way you won't need to blow the entire connection tracking table away like "conntrack -D" does.
To me this looks like a bug in the conntrack module. It shouldn't be necessary to clean the table manually once in a while.
I don't know if it's a bug in connection tracking or not. It might simply be a race condition. I.e. depending on which direction CT sees GRE packets from first, and possibly associated replies. Possibly leading to an undesired state ala race condition.
Note: CT state expiration can also likely cause the "seen first" issue again, even after the systems have been up and the tunnels have passed traffic.
Try clearing the connection tracking table, and then starting a persistent ping through each tunnel and seeing if the tunnels stay up and functional. - I.e. constantly send traffic through the tunnels to make sure that the connection tracking table entries don't become stale, which leads to them getting purged, which means a new "first seen" condition again.
If the persistent ping does work, 1) you have a workaround, and 2) you know that it's likely CT state expiration, which means that there may be a tunable that can help prevent the relevant state information from expiring.
-- Grant. . . . unix || die -- To unsubscribe from this list: send the line "unsubscribe lartc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html