Re: Recovery Assistance

Brian Mills <brian@xxxxxxxxxxxxxx> · Sat, 28 Jan 2017 12:40:20 +1100

First of all, Thank you for your time to assist me learning. I really appreciate it. 
root# ps ax | grep post
 1364 ?        Ss     0:00 /usr/lib/postfix/master
 5198 pts/3    S      0:00 su postgres
 5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D /etc/postgresql/9.3/main
 5222 ?        Ss     0:10 postgres: startup process   recovering 0000000100000005000000A3 
 7930 pts/4    S+     0:00 grep --color=auto post

Its a single machine postgres database server, so I'm assuming there is no cluster log. If there is one, where would I look for it? 
The only log in /var/log/postgres is postgresql-9.3-main.log

which shows (tail):

2017-01-27 20:27:01 AEDT LOG:  database system was shut down at 2017-01-27 20:26:29 AEDT
2017-01-27 20:27:01 AEDT LOG:  MultiXact member wraparound protections are now enabled
2017-01-27 20:27:01 AEDT LOG:  autovacuum launcher started
2017-01-27 20:27:01 AEDT LOG:  database system is ready to accept connections
2017-01-27 20:27:02 AEDT LOG:  incomplete startup packet
2017-01-27 20:28:54 AEDT ERROR:  unexpected chunk size 104 (expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
2017-01-27 20:28:54 AEDT STATEMENT:  COPY public.bodycontent (bodycontentid, body, contentid, bodytypeid) TO stdout;
2017-01-27 20:30:13 AEDT LOG:  received fast shutdown request
2017-01-27 20:30:13 AEDT LOG:  aborting any active transactions
2017-01-27 20:30:13 AEDT LOG:  autovacuum launcher shutting down
2017-01-27 20:30:13 AEDT LOG:  shutting down
2017-01-27 20:30:13 AEDT LOG:  database system is shut down

I ran the screen utility so I could leave the db started using the pg_ctl command. The later logs in that session have not progressed, its last entry is still
2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up 
Which is still later datetime than the /var/log/postgres... log. 

Connection attempt shows the same log
postgres@atlassian:/home/myuser$ psql
psql: FATAL:  the database system is starting up

Nothing in the syslog that seems connected. 

Brian Mills
CTO

Mob: 0410660003
Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 01/27/2017 01:31 PM, Brian Mills wrote:

Hi,

I have a Atlassian Confluence Wiki that depends on postgres, but I

haven't much experience with postgres other than for this purpose.

A few days ago, the hard disk filled, so all services stopped working.

When the admin realised this he increased the disk size (its in a cloud,

so that was easy to do) however I think the way it shutdown left

Postgres in an inconsistent state for some reason.

Now when I start it up I get this message in the log over and over again:

"FATAL:  the database system is starting up"

I have a backup, which I have successfully recovered, but it is 24 hours

old, the next backup was the cause of the disk filling. So I'm using

this as exercise in learning a bit more about postgres.

I did some research and found a number of options. I took a file level

backup with the service not running then tried 2 things. I haven't found

much else on what to do though.

*Attempt 1 - Reset Log *

It sounded like this shouldn't be my first option (it wasn't) but it did

sound like what I needed to do.

I ran this command

./pg_resetxlog /var/lib/postgresql/9.3/main -f

It worked a treat, the database did startup ok.

However when I tried to dump the DB:

root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >

$now-confluencedb.sql

pg_dump: Dumping the contents of table "bodycontent" failed:

PQgetResult() failed.

pg_dump: Error message from server: ERROR:  unexpected chunk size 104

(expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028

pg_dump: The command was: COPY public.bodycontent (bodycontentid, body,

contentid, bodytypeid) TO stdout;

The dump failed, so I assume this did leave my database in an

inconsistent state.

*Attempt 2 -  startup manually and let it try recovery*

I restored my file level backup and started again.

This time I tried to startup manually on the command line to see the

output (I'd done it as a service startup a number of times to nearly the

same effect too)

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D

/etc/postgresql/9.3/main start

server starting

postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT

LOG:  database system was interrupted while in recovery at 2017-01-27

20:13:26 AEDT

2017-01-27 20:36:08 AEDT HINT:  This probably means that some data is

corrupted and you will have to use the last backup for recovery.

2017-01-27 20:36:08 AEDT LOG:  database system was not properly shut

down; automatic recovery in progress

2017-01-27 20:36:08 AEDT LOG:  redo starts at 5/528B4558

2017-01-27 20:36:18 AEDT LOG:  redo done at 5/A3FFF9E8

2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log

time 2017-01-24 02:08:00.023064+11

2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

What does ps ax | grep post show?

Is the cluster set up to log to a file, if so what does it show?

Does the system log show anything relevant?

I've left it that way for 12 hours, and its still not allowing connections.

I assume its doing some kind of consistency check?

What does it say when you attempt a connection?

Does anyone have any suggestions on what I should be doing to try and

restore this database?

- The amount of change is minimal in the DB after 6pm it should be

basically no change overnight.

- The log above seems to suggest it has completed a redo ok, is that right?

- The last completed transaction time 2017-01-24 02:08:00.023064+11 the

log mentions would be fine to use, so loosing even a few hours before

that would be more than adequate.

I'm just not clear what the database is doing now, or how I should be

trying to recover it.

Any help anyone can suggest would be great! I've given myself this

weekend to attempt to recover more than the last backup, after that I

need to restore the service for my team to use and suck up the lost last

day of updates.

Thanks,

Brian

-- 

Adrian Klaver

adrian.klaver@xxxxxxxxxxx