Hi,
I have a Atlassian Confluence Wiki that depends on postgres, but I haven't much experience with postgres other than for this purpose.
A few days ago, the hard disk filled, so all services stopped working.
When the admin realised this he increased the disk size (its in a cloud, so that was easy to do) however I think the way it shutdown left Postgres in an inconsistent state for some reason.
Now when I start it up I get this message in the log over and over again:
"FATAL: the database system is starting up"
I have a backup, which I have successfully recovered, but it is 24 hours old, the next backup was the cause of the disk filling. So I'm using this as exercise in learning a bit more about postgres.
I did some research and found a number of options. I took a file level backup with the service not running then tried 2 things. I haven't found much else on what to do though.
Attempt 1 - Reset Log
It sounded like this shouldn't be my first option (it wasn't) but it did sound like what I needed to do.
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb > $now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: unexpected chunk size 104 (expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
pg_dump: The command was: COPY public.bodycontent (bodycontentid, body, contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an inconsistent state.
Attempt 2 - startup manually and let it try recovery
I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the output (I'd done it as a service startup a number of times to nearly the same effect too)
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D /etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT LOG: database system was interrupted while in recovery at 2017-01-27 20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG: database system was not properly shut down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
I ran this command
./pg_resetxlog /var/lib/postgresql/9.3/main -f
It worked a treat, the database did startup ok.
However when I tried to dump the DB:
root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb > $now-confluencedb.sql
pg_dump: Dumping the contents of table "bodycontent" failed: PQgetResult() failed.
pg_dump: Error message from server: ERROR: unexpected chunk size 104 (expected 1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028
pg_dump: The command was: COPY public.bodycontent (bodycontentid, body, contentid, bodytypeid) TO stdout;
The dump failed, so I assume this did leave my database in an inconsistent state.
Attempt 2 - startup manually and let it try recovery
I restored my file level backup and started again.
This time I tried to startup manually on the command line to see the output (I'd done it as a service startup a number of times to nearly the same effect too)
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D /etc/postgresql/9.3/main start
server starting
postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27 20:36:08 AEDT LOG: database system was interrupted while in recovery at 2017-01-27 20:13:26 AEDT
2017-01-27 20:36:08 AEDT HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
2017-01-27 20:36:08 AEDT LOG: database system was not properly shut down; automatic recovery in progress
2017-01-27 20:36:08 AEDT LOG: redo starts at 5/528B4558
2017-01-27 20:36:18 AEDT LOG: redo done at 5/A3FFF9E8
2017-01-27 20:36:18 AEDT LOG: last completed transaction was at log time 2017-01-24 02:08:00.023064+11
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
2017-01-27 23:00:01 AEDT FATAL: the database system is starting up
I've left it that way for 12 hours, and its still not allowing connections.
I assume its doing some kind of consistency check?
Does anyone have any suggestions on what I should be doing to try and restore this database?
- The amount of change is minimal in the DB after 6pm it should be basically no change overnight.
- The log above seems to suggest it has completed a redo ok, is that right?
- The last completed transaction time 2017-01-24 02:08:00.023064+11 the log mentions would be fine to use, so loosing even a few hours before that would be more than adequate.
I'm just not clear what the database is doing now, or how I should be trying to recover it.
Any help anyone can suggest would be great! I've given myself this weekend to attempt to recover more than the last backup, after that I need to restore the service for my team to use and suck up the lost last day of updates.
Thanks,
Brian