So for about a month now, we've been getting things prepared to use a BDR cluster in a production, multi-region setup on aws. Our initial testing produced some absolutely fantastic results with replication delays less than 150ms between singapore, ireland, and north virginia and this is will SSL encryption.
We have, just recently, ran into a problem. I created a test cluster only within NV and after about a week of working without any problems, we got an error: Unexpected EOF on SSL connection. I had seen something like this before but on initial cluster join and chalked it up to me doing something wrong. This was after a week of working without issue. I wasn't sure what to do next. restarting the database started producing errors like this:
LOG: starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL: mismatch in worker state, got 3, expected 1
LOG: starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL: mismatch in worker state, got 3, expected 1
FATAL: mismatch in worker state, got 3, expected 1
LOG: starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
LOG: worker process: bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1, (PID 20300) exited with exit code 1
This would repeat. So I removed this node from the cluster using the proper bdr commands and tried re-joining but that just resulted in the return error changing from a 3 to a 0 and the same errors repeating. I have BDR completely automated and orchestrated using chef so I simply fired up a new cluster and started over.
My problem is I don't know what caused this and, more importantly, I'm not sure how to fix it / prevent it and I can't launch this into production without figuring this out.
One other thing: I've seen a lot of conflicting information on how to setup BDR on ubuntu (using ppas, what pkg to install, and where to get source) I'm curious now if I don't have a younger version and that this issue is all but fixed now. Here are my build steps if anyone has any comments on how to setup bdr better, please let me know.
I grab postgres 9.4.4 from here:
and compile it with "./configure --prefix=/opt/psql --with-openssl && make -j4 -s install"
then I compile and install the btree_gist module
then I get the BDR plugin from here:
and compile it with "./configure && make -j4 -s all && make install"
then init the db and set everything with config, ssl certs, and cluster creation and joining.
Any help on this would be really appreciated.
Thanks guys
Charles