Hi. Some comments on your questions. Hope it helps. 1. I had this problem many times in the past. It's not related to the log size, but to the wal segment size, apparently. More exactly, I experienced that problem each time I had a LARGE transaction(about 10M updates on one table an 60M-100M on another one, all in a single transaction). My solution was to modify my application in order to use smaller transactions(BTW, that large transaction was a bug anyway). Recently, I experienced a network outage between my 2 db nodes for about 6 hours. When connection came back, both nodes started to replicate with no problem(Got lots of conflict resolution messages). So, I insist, this is probably more related to transaction size, not log size. An should not be a problem arising from network outage. 2. When the problem happened, I used bdr_init_copy. I consider it the fastest way. But that will only do if you can put down one of your database servers for a while(exact time depends on database size and network speed). If not, you would probably prefer to use bdr_group_join function, which can be run with postgres running on destination node. Also, while bdr_group_join function joins only a specific database to replication, bdr_init_copy setup replication for all bdr-enabled database on origin node. 3. BDR does replicate DDL. It has some restrictions(http://bdr-project.org/docs/stable/ddl-replication-statements.html), but should not be a problem. However, I would not recommend to use DDL on your application. First, because DDL replication requires full database lock, which will cause all transactions to wait until the DDL replication is completed on all nodes. Second, because, under some circumstances(still not have it really clear), DDL replication will hang up, and you will have to restart some of the nodes in order to be able to replicate DDL again. I recommend DDL replication only for sysadmin usage, and with caution. Also, because BDR is asynchronous, if you are about to run a sequence like this one: UPDATE tab SET col = 0; ALTER TABLE tab ALTER COLUMN col SET NOT NULL; You must wait a while before running the ALTER, because it could happen(has happened to me) that the ALTER starts when the UPDATE has not yet been replicated; that would cause a lock where postgres on any other node will try to replicate that DDL and will fail forever because the column still contains NULL values. The only solution I found was to run postgres in single user mode, and that is really dangerous if you are using BDR. 4. Your first case may not happen, because BDR will replicate transaction in order. For the second one, I've never ran into any similar case. In brief, while BDR is powerful and really useful, it's a little bit dangerous. "Use with caution". Also, you will probably have to tune your application so you won't run into BDR-specific problems. BTW, there is no exactly a "licensed version" and a "free/comunity version". 1.x is available freely, while 2.x was made available only to 2ndQuadrant customers. 3.x is currently on the same state, but it is supposed to be made publicly available, if I'm not wrong. Regards, Alvaro Aguayo Operations Manager Open Comb Systems E.I.R.L. (+51-1) 337-7813 Ext. 4002 www.ocs.pe ----- Original Message ----- From: "soumitra bhandary" <soumitra.bhandary@xxxxxxxxxxx> To: "Rijo Roy" <royvk6644@xxxxxxxxx>, pgsql-admin@xxxxxxxxxxxxxxxxxxxx, "Prince Pathria" <prince.pathria@xxxxxxxxxxx>, bdr-list@xxxxxxxxxxxxxxx Sent: Thursday, 11 April, 2019 21:50:50 Subject: Re: BDR OR MASTER SLAVE REPLICATION WITH REPMGR IS SUITABLE FOR LARGE ONLINE APPLICATION Thanks for the response . Here is my observation / queries about BDR , correct me if I am missing anything . Any other issue with BDR if you guys faced let me know . 1. Due to any reason if replication is failing and replication log byte size reaches more than 1GB . Then corresponding replication slot become unresponsive and need to set up BDR again. 2. Due to any issue (may be network failure or any key conflicts) if replication is stopped or failing then apart from re initializing the cluster (using bdr_init_copy<http://bdr-project.org/docs/stable/command-bdr-init-copy.html> ) any other approach is there to revive the replication ? As bdr.bdr_group_join<http://bdr-project.org/docs/stable/functions-node-mgmt.html#FUNCTION-BDR-GROUP-JOIN> does not work well for large databases for more than 100 GB . 3. For a large OLTP application does BDR replicates the changes made to procedure, function ,triggers , views definition, across the cluster if I set the replication at DB level ? (I am not sure as not tested) 4. Though BDR tries to resolve conflicts using a simple last-update-wins strategy, but still conflicts as mentioned below are very common in a production instance . Are these taken care in the licensed version ? ERROR: insert or update on table "AAAAA" violates foreign key constraint "fk_traf444kk6qrkms7n56aiwq5y" Violating foreign key constraint ,whereas other node is having record. ERROR: multiple unique constraints violated by remotely Inserted tuple An INSERT/INSERT conflict can violate more than one UNIQUE constraint (of which one might be the PRIMARY KEY). 1. Back up , recovery and PITR process . As per the documentation , Logical replication cannot be used for point-in-time recovery . So what is the extent of this statement ? Does it imply even we run the DBs in archival mode and use pg_basebackup to take file level back up in a running instance? Thanks , Soumitra ________________________________ From: Rijo Roy <royvk6644@xxxxxxxxx> Sent: Friday, April 12, 2019 7:42 AM To: soumitra bhandary Cc: Prince Pathria; pgsql-admin@xxxxxxxxxxxxxxxxxxxx Subject: Re: BDR OR MASTER SLAVE REPLICATION WITH REPMGR IS SUITABLE FOR LARGE ONLINE APPLICATION BDR is best when you have a need of multi-master environment like a distributed database paradigm where you have a need of writing/reading to/from local masters let's say masters in NY, London, China, India etc.. And it uses pglogical to achieve this with a globally managed transaction XID's. It needs to be properly tested how it handles conflicts in a heavy oltp workload application. If you only need high availability for a heavy oltp application, a well set active passive model of streaking replication backed by repmgr where writes are only hitting primary and reads on standby will help. Otherwise you can also consider citusdata.. Note: I am not an advocate of any vendors, my comments are purely based on my experience and I love Postgresql the way it is! Regards, Rijo Roy On Fri, 12 Apr 2019, 6:34 am soumitra bhandary, <soumitra.bhandary@xxxxxxxxxxx<mailto:soumitra.bhandary@xxxxxxxxxxx>> wrote: To ensure zero down time. Sent from my iPhone On 12-Apr-2019, at 12:35 AM, Prince Pathria <prince.pathria@xxxxxxxxxxx<mailto:prince.pathria@xxxxxxxxxxx>> wrote: It very much depends upon your use case. Why you want a multi master implementation? Happy to help :) Prince Pathria Systems Engineer Evive +91 9478670472 goevive.com<http://goevive.com> On Thu, Apr 11, 2019 at 8:03 PM soumitra bhandary <soumitra.bhandary@xxxxxxxxxxx<mailto:soumitra.bhandary@xxxxxxxxxxx>> wrote: Hi , Can anyone suggest me for large OLTP application which sort of replication is suitable BDR or master slave replication with synchronous and asynchronous node and REPMGR enabled. Can BDR handle real-time conflicts arises due to high network latency or Dataintegrity conflicts?? Please suggest Sent from my iPhone