Also,
Is there a way to remove these things from the init target node easier?
d= p=504 a=ERROR: 55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL: Found bdr.bdr_nodes entry for bdr (6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT: Remove all replication identifiers and slots corresponding to this node from the init target node then drop and recreate this database and try again
On Mon, May 11, 2015 at 11:33 PM, Wayne E. Seguin <wayneeseguin@xxxxxxxxx> wrote:
To the wonderful BDR folks,I have a 5 node pg+bdr cluster that I've spun up. I have verified that each node can connect via psql to each other node. I am having an issue creating the bdr group and having the nodes join each other.The steps I am taking are as follows (I used pg compiled using getting started guide instructions):1. initdb2. start postgres3. create database pgbdr4. create extension btree_gist (actually I'm doing this on template1 for my purposes)5. create extension bdr6. on node 0 I am doing:SELECT bdr.bdr_group_create(local_node_name := 'pgbdr0',node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr');7. on nodes 1-4 I am doing (adjusted for the nodes IP):SELECT bdr.bdr_group_join(local_node_name := 'pgbdr1',node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr',join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr');Sometimes this works and all nodes are joined, but it isn't consistent and I'm not sure at the moment how to debug why.From the most recent attempt on node 0:pgbdr=# SELECT * FROM bdr.bdr_nodes;node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_dsn---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------6147857338297172162 | 1 | 16908 | r | rdpg0 | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |(1 row)From the most recent attempt on node 1:pgbdr=# SELECT * FROM bdr.bdr_nodes;node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_dsn---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+------------------------------------------------------6147857345042854082 | 1 | 16908 | i | rdpg1 | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr(1 row)I have pasted some relevant logs from node 0 further below.Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will usually just hang.How can I debug why the nodes are not connecting to each other find out where they getting stuck trying to join?d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEM [171/773]d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857345042854082_1_16908__"LOGICAL bdrd=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18E87A0d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18E87A0d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL: There are no running transactions.d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1271d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 00000: exported logical decoding snapshot: "0000071E-1" with 0 transaction IDsd=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEMd=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857344392790209_1_16908__"LOGICAL bdrd=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18EC970d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000: logical decoding found initial starting point at 0/18EC970d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: 1 transaction needs to finish.d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1338d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEMd=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340000055490_1_16908__"LOGICAL bdrd=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transactiond=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION: SocketBackend, postgres.c:344d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18EC9F0d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000: logical decoding found initial starting point at 0/18EC9F0d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: 1 transaction needs to finish.d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1338d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18ECA30d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL: Transaction ID 1822 finished; no more running transactions.d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SnapBuildEndTxn, snapbuild.c:907d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18ECA30d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL: Transaction ID 1822 finished; no more running transactions.d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SnapBuildEndTxn, snapbuild.c:907d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 00000: exported logical decoding snapshot: "00000720-1" with 0 transaction IDsd=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 00000: exported logical decoding snapshot: "0000071F-1" with 1 transaction IDd=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000: received replication command: IDENTIFY_SYSTEMd=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340923527359_1_16908__"LOGICAL bdrd=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: exec_replication_command, walsender.c:1291d= p=392 a=DEBUG: 00000: started 0 electionsd= p=392 a=LOCATION: bdr_sequencer_start_elections, bdr_seq.c:688d= p=392 a=DEBUG: 00000: started 0 votesd= p=392 a=LOCATION: bdr_sequencer_vote, bdr_seq.c:629d= p=392 a=DEBUG: 00000: tallied 0 electionsd= p=392 a=LOCATION: bdr_sequencer_tally, bdr_seq.c:755d= p=392 a=DEBUG: 00000: checked 0 sequences for fillingd= p=392 a=LOCATION: bdr_sequencer_fill_sequences, bdr_seq.c:1034d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transactiond=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION: SocketBackend, postgres.c:344d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transactiond=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION: SocketBackend, postgres.c:344d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG: 00000: searching for logical decoding starting point, starting at 0/18ECA60d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: DecodingContextFindStartpoint, logical.c:469d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000: logical decoding found initial starting point at 0/18ECA60d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: 2 transactions need to finish.d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SnapBuildFindSnapshot, snapbuild.c:1338d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000: logical decoding found consistent point at 0/18ECAD0d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL: Transaction ID 1823 finished; no more running transactions.d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SnapBuildEndTxn, snapbuild.c:907d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 00000: exported logical decoding snapshot: "00000721-1" with 0 transaction IDsd=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SnapBuildExportSnapshot, snapbuild.c:604d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG: 08006: unexpected EOF on client connection with an open transactiond=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION: SocketBackend, postgres.c:344