Re: via psycopg2 or pg2pg? Move rows from one database to other

Thomas Güttler <guettliml@xxxxxxxxxxxxxxxxxx> · Mon, 27 Feb 2017 12:47:03 +0100

Thank you for explaining the steps of your algorithm.

Just one question: How to do the actual transfer of data?

I see two solutions:

1, Read the data into a script (via psycopg2 (we love python))
and dump it into a second connection.

2, connect postgres to postgres and transfer the data without a database
adapter like psycopg2.

Regards,
  Thomas

Am 23.02.2017 um 17:40 schrieb Francisco Olarte:
Thomas:

On Thu, Feb 23, 2017 at 4:16 PM, Thomas Güttler
<guettliml@xxxxxxxxxxxxxxxxxx> wrote:
Am 22.02.2017 um 16:00 schrieb Adrian Klaver:
only written on a successful transfer. To improve the chances of
successful transfer more smaller transfer batches
rather then larger transfers.

I really need a solid solution.
You said "... improve the chances of successful transfer ...". This makes me
nervous.

I think what Adrian say is you improve the individual transfer time,
if it fails you retry.

Delays are no problems, but data loss or duplication is.

Remember you can never guarantee 'exactly once' without very complex
solutions, I think you can do "at least once" or "at most once". That
means lose or duplicate.

That being said, IF you have some kind of global, unchanging ( at the
central site ) unique key, you could try the following, using a
holding table in each satellite and assuming you have 'on conflict do
nothing'.

1.- Move rows from main to holding table in the satelite, in a single
transaction. This is to let you work with an unmovable set ( as your
process is the only one touching the holding tables ). If there is
some data in holding it is no problem, they are from a previous
crashed transfer.

2.- Insert every thing from the holding table in main, using on
conflict do nothing.

3.- When everything is commited in main, truncate the satellite holding table.

If satellite crashes in 1 it will roll back, you have not touched main.

If you crash in 2 you will find 1 partially full in the next round,
and main will be rolled back ( it's important to not commit until
everything is done in 2, i.e., if you have read problems in the
satellite do no go to 3, just crash and rollback everything ). You can
either do a loop with the current set or append more data, your
choice, does not matter, as you have to reinsert. The on conflict do
nothing in 2 will take care of potential duplicates.

If you crash in 3 you will transfer the lot again, but the do-nothing
in 2 will eliminate it and 3 will eventually purge it.

You can optimize on that, but basically you just repeat this until
everything goes fine. I do these ( just with two DBs, not 100 ) and it
works.

It does a lot of duplicate work, but only on problems, it normally runs smooth.

If you do not have "on conflict do nothing" ( I do not remember the
versions ) you can use an extra step. Instead of inserting in main in
2 do 2.a - Copy holding to main ( truncating before hand if copy
present ) and 2.b insert news from the copy, either by using and
anti-join with main or by deleting ( in the same transaction ) the
dupes before inserting.

Francisco Olarte.

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general