Hi,
We have a script that runs a pg_dump off of an RDS PG13.3 replica several times per day. We then load this dump using pg_restore into another postgres RDS db in another AWS account, scrub some of the data, and then take a snapshot of it. We then use these snapshots for testing and staging servers. This script has been running fine for years without any issues.
However, a little less than a month ago, we changed the pg_dump command in the script to use the "-j" flag to speed up our dumps. It was my understanding that starting with postgres v10, this option should now be supported on replicas. Since we recently upgraded to Postgres 13.3 from 9.6, we thought it would be a good idea to start using this feature to speed up our pg_dumps.
In order to use parallel dumps, we made two changes to our script:
1) We swapped to this command:
pg_dump -Fd -j9 -Z0 -x -hreplica -Uuser -f dump_dir
For reference, the previous command (which never had any problems) was:
pg_dump -Fc -Z0 -x -hreplica -Uuser > dump
2) We added code to resize our replica db to a larger hardware size before starting the pg_dump so we can utilize more parallelism. After resizing the db, the script waits for our replica to go back to the "available" state before running the pg_dump command. When the pg_dump finishes, we then resize the db back down to a t3.medium for cost reasons.
When we first rolled this out, the speed improvement was fantastic, it brought our pg_dump down from 30 minutes to 5. However, within a week or so of rolling this out, we've noticed that sometimes the dumps produced by our pg_dump command were not consistent/synchronized. Sometimes when we would restore the backup using pg_restore to our intermediate db it would throw duplicate key errors or other types of constraint errors. Sometimes, even when the pg_restore fully restored without errors, we noticed that some of the sequences were wrong (we saw some sequences with a value less than the max id in the column, so new inserts would start failing on the table). The errors were never consistent either.
It appears to me that the dump produced by our pg_dump is not actually synchronized. I checked the postgres logs on the server and don't see anything noteworthy. We also are not getting any errors from our pg_dump command. Everything we are using is version 13.3 (all of the dbs involved and the server running the script is using v13.3 of pg_dump/pg_restore commands)
I was curious if anyone had any thoughts on what could be causing this. From my reading of the pg_dump manual, it seems like it should always be producing a consistent dump, but please correct me if I'm misunderstanding. I've also opened a ticket with AWS's support to see if they have any ideas.
For now, we've had to go back to using the single threaded pg_dump (which is disappointing because it's 25 minutes slower). We were really hoping to take advantage of the parallelism.
Any ideas would be much appreciated.
Thanks,
Chris