I have been struggling with this particular scenario now for a few days and very frustrated with the consistent results. maybe someone can help.
We are migrating a service to new docker container from a bare metal box. This is one of about 150 which we have done already successfully. basically setup a replica, promote it, and point the f5 to the new master. However with this particular migration we always get index corruption on the new server once we are up and running.
In a nut shell, here is our flow.
- Use pg_basebackup initiated from new server against the old one to establish a replica of the old server.
- Recovery.conf file has a restore command to point to the archive directory because we cycle through wal files very frequently.
- Start new replica, let it do it's magic and catch up.
At this point all seems normal. However certain queries against the replica return drastically different result sets when compared to the primary. During our investigation, we promote the replica and do a few things including a vacuumdb -za. This is where we see error messages regarding indexes such as ...
ERROR: failed to re-find parent key in index "myindex" for deletion target page 851753
We could just reindex and be done. But, the db is very large and hot, we cannot have inconsistent data while we reindex.
When we run the same vacuum on the source server, all is good so the source data is good.
We can repeat this scenario in both staging and production so I am thinking it is something that is incompatible between the two.
So a little background just for your reference.
- Source database and target database are the same versions of postgres ( 9.5 ) However, the minor version is different. 9.5.5 vs 9.5.7
- fsync = on
- full_page_write = on
Any tips or help would be appreciated.
Don't have the luxury of doing a pg_dump :(
Thanks
JT