Re: 9.2 to 9.5 pg_upgrade losing data

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Mon, 15 Aug 2016 07:55:06 -0700

On 08/15/2016 07:40 AM, Pete Fuller wrote:
Directories are correct.  We do not utilize tablespaces.

Anything obviously different in the setup between your production 
servers and the testing and development clusters?

On Aug 15, 2016, at 10:06 AM, Adrian Klaver <adrian.klaver@xxxxxxxxxxx
<mailto:adrian.klaver@xxxxxxxxxxx>> wrote:

On 08/15/2016 06:20 AM, Pete Fuller wrote:
Hello all,

We are attempting to upgrade a production 9.2 postgres cluster to
9.5. The server is running Centos 7 with the Centos version -
currently 9.2.15 We are installing the postgresql provided rpm of
postgresql9.5 from the postgresql repo, currently 9.5.4. In testing
and on our development cluster, the upgrade worked without issue but
something is happening on our production DB that we can not figure out.

The 9.2 cluster is in streaming replication controlled by pacemaker.
Steps we have been using are the following.

• verify the replica server is in a good state with up to date data,
then put the machine in standby, stopping the 9.2 db
• Removing the recovery.conf file and starting the 9.2 server
manually to get some stats to verify later (counts of rows from
several key tables)
• Stop the 9.2 server, (/bin/pg_ctl stop -D /var/lib/pgsql/data ) and
install the 9.5 binaries.
• init the 9.5 db
• run a pg_upgrade with the link option. The full command we are using is
/usr/pgsql-9.5/bin/pg_upgrade -d /var/lib/pgsql/data -D
/var/lib/pgsql/9.5/data -b /bin -B /usr/pgsql-9.5/bin/ --link -r -v

Are you sure you are linking the appropriate directories?

Are there any tablespaces in the mix?

• run the full vacuum analyze script that pg_upgrade generated

What we are finding in testing is we get thru the upgrade process,
start the 9.5 server manually to compare the row counts we collected
immediately before the upgrade and we are seeing different data,
sometimes drastically. This data includes missing user accounts, etc,
that is not perishable data.

For example - 9.2 standalone before running pg_upgrade gives this
information
tracks=# select count(*) from users;
count
-------
13945
(1 row)

tracks=# select count(*) from sessions;
count
-------
  559
(1 row)

PG 9.5 after the upgrade - same machine, upgraded database

tracks=# select count(*) from sessions;
-------
  155
(1 row)

tracks=# select count(*) from users;
count
-------
13157
(1 row)

We have enabled verbose logging and have reviewed it but are seeing
no errors in the migration. The 9.5 server starts without complaint
and if we did not do these manual checks we would not see a problem.
We have attempted this upgrade on this machine repeatedly in prepping
for our maintenance windows, and on occasion our manual checks have
shown good numbers.  At this point, though, it has failed silently
enough times that I have little confidence in the data even if our
rudimentary checks do show identical counts after the upgrade.

Has anyone else seen this issue?

The only 'hack' during the upgrade that isn't straight from the
postgres_upgrade man page is to handle a renamed option in postgresql
that cents back ported - unix_socket_directories. Using the rename of
pg_ctl that is referenced in the top answer here to force a couple
variables into the centos provided pg_ctl binary -
http://dba.stackexchange.com/questions/50135/pg-upgrade-unrecognized-configuration-parameter-unix-socket-directory

Any information or suggestions would be helpful at this point.

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general