Re: psql: FATAL: the database system is starting up

Tom K <tomkcpr@xxxxxxxxx> · Sat, 1 Jun 2019 15:42:42 -0400

On Sat, Jun 1, 2019 at 3:32 PM Tom K <tomkcpr@xxxxxxxxx> wrote:

On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 5/31/19 7:53 PM, Tom K wrote:

> 

>     There are two places to connect with the Patroni community: on github,

>     via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If

>     you're using Patroni, or just interested, please join us.

> 

> 

> Will post there as well.  Thank you.  My thinking was to post here first 

> since I suspect the Patroni community will simply refer me back here 

> given that the PostgreSQL errors are originating directly from PostgreSQL.

> 

> 

>     That being said, can you start the copied Postgres instance without

>     using the Patroni instrumentation?

> 

> 

> Yes, that is something I have been trying to do actually.  But I hit a 

> dead end with the three errors above.

> 

> So what I did is to copy a single node's backed up copy of the data 

> files to */data/patroni* of the same node ( this is the psql data 

> directory as defined through patroni ) of the same node then ran this ( 

> psql03 = 192.168.0.118 ):

> 

> # sudo su - postgres

> $ /usr/pgsql-10/bin/postgres -D /data/patroni 

> --config-file=/data/patroni/postgresql.conf 

> --listen_addresses=192.168.0.118 --max_worker_processes=8 

> --max_locks_per_transaction=64 --wal_level=replica 

> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 

> --max_replication_slots=10 --max_connections=100 --hot_standby=on 

> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

Why all the options?

That should be covered in postgresql.conf, no?

> 

> This resulted in one of the 3 messages above.  Hence the post here.  If 

> I can start a single instance, I should be fine since I could then 1) 

> replicate over to the other two or 2) simply take a dump, reinitialize 

> all the databases then restore the dump.

> 

What if you move the recovery.conf file out?

Will try.

The below looks like missing/corrupted/incorrect files. Hard to tell 

without knowing what Patroni did?

Storage disappeared from underneath these clusters.  The OS was of course still in memory making futile attempts to write to disk, which would never complete.

My best guess is that Patroni or postgress was in the middle of some writes across the clusters when the failure occurred.  

Of note are the characters f2W below.  I see nothing in the postgres source code to indicate this is any recognizable postgres message.  A part of me suspects that the postgres binaries got corrupted.   Had this case occur with glib-common and a reinstall fixed it.  However the postgres binaries csum matches a standalone install perfectly so that should not be an issue.  

> Using the above procedure I get one of three error messages when using 

> the data files of each node:

> 

> [ PSQL01 ]

> postgres: postgres: startup process waiting for 000000010000000000000008

> 

> [ PSQL02 ]

> PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

> 

> [ PSQL03 }

> FATAL:syntax error inhistory file:f2W

> 

> And I can't start any one of them.

> 

> 

> 

>      >

>      > Thx,

>      > TK

>      >

> 

> 

> 

>     -- 

>     Adrian Klaver

>     adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>

> 

-- 

Adrian Klaver

adrian.klaver@xxxxxxxxxxx