Re: Recovery Assistance

Brian Mills <brian@xxxxxxxxxxxxxx> · Sun, 29 Jan 2017 07:16:00 +1100

Hi,
No, it hasn't changed since the first time I looked at it. 

root@atlassian:/home/tbadmin# ps ax | grep post
 1364 ?        Ss     0:00 /usr/lib/postfix/master
 5198 pts/3    S      0:00 su postgres
 5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D /etc/postgresql/9.3/main
 5222 ?        Ss     0:10 postgres: startup process   recovering 0000000100000005000000A3
11161 pts/4    S+     0:00 grep --color=auto post

The /var/log/postgres logs haven't got anything new in them. 
Also the screen session hasn't got anything new in them, except it adds a line every time I attempt to connect to it using psql. The last attempt below was my connection attempt for this email, a bit over 8 hours after my last connection attempt. 

Screen session: 
2017-01-28 12:38:35 AEDT FATAL:  the database system is starting up
2017-01-28 23:00:01 AEDT FATAL:  the database system is starting up
2017-01-28 23:00:01 AEDT FATAL:  the database system is starting up
2017-01-29 07:14:00 AEDT FATAL:  the database system is starting up

I also still can't connect. 
postgres@atlassian:/home/tbadmin$ psql
psql: FATAL:  the database system is starting up

Brian Mills
CTO

Mob: 0410660003
Melbourne 03 9012 3460 or 03 8376 6327 |  Sydney 02 8064 3600 |  Brisbane 07 3173 1570
Level 1 |  600 Chapel Street | South Yarra|  VIC |  3141 |  Australia 

On 28 January 2017 at 16:08, Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 01/27/2017 05:40 PM, Brian Mills wrote:

First of all, Thank you for your time to assist me learning. I really

appreciate it.

root# ps ax | grep post

 1364 ?        Ss     0:00 /usr/lib/postfix/master

 5198 pts/3    S      0:00 su postgres

 5221 pts/3    S      0:00 /usr/lib/postgresql/9.3/bin/postgres -D

/etc/postgresql/9.3/main

 5222 ?        Ss     0:10 postgres: startup process   recovering

0000000100000005000000A3

 7930 pts/4    S+     0:00 grep --color=auto post

So if you check back does the recovering XXXXXXXX part change?

If so Postgres is walking through the WAL files as it should.

Its a single machine postgres database server, so I'm assuming there is

no cluster log. If there is one, where would I look for it?

The only log in /var/log/postgres is postgresql-9.3-main.log

That would be it. A single Postgres instance has multiple databases in it, by default it starts with template0, template1 and postgres databases. Then add whatever databases you created and you have a cluster of databases.

which shows (tail):

2017-01-27 20:27:01 AEDT LOG:  database system was shut down at

2017-01-27 20:26:29 AEDT

2017-01-27 20:27:01 AEDT LOG:  MultiXact member wraparound protections

are now enabled

2017-01-27 20:27:01 AEDT LOG:  autovacuum launcher started

2017-01-27 20:27:01 AEDT LOG:  database system is ready to accept

connections

2017-01-27 20:27:02 AEDT LOG:  incomplete startup packet

2017-01-27 20:28:54 AEDT ERROR:  unexpected chunk size 104 (expected

1996) in chunk 3 of 22 for toast value 48862 in pg_toast_20028

2017-01-27 20:28:54 AEDT STATEMENT:  COPY public.bodycontent

(bodycontentid, body, contentid, bodytypeid) TO stdout;

2017-01-27 20:30:13 AEDT LOG:  received fast shutdown request

2017-01-27 20:30:13 AEDT LOG:  aborting any active transactions

2017-01-27 20:30:13 AEDT LOG:  autovacuum launcher shutting down

2017-01-27 20:30:13 AEDT LOG:  shutting down

2017-01-27 20:30:13 AEDT LOG:  database system is shut down

That would be your Attempt 1 log.

I ran the screen utility so I could leave the db started using

the pg_ctl command. The later logs in that session have not progressed,

its last entry is still

2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

Which is still later datetime than the /var/log/postgres... log.

So it is just logging to stdout and not to the log file.

Connection attempt shows the same log

postgres@atlassian:/home/myuser$ psql

psql: FATAL:  the database system is starting up

Nothing in the syslog that seems connected.

*Brian Mills*

CTO

*Mob: *0410660003 <tel:0410660003>

*Melbourne* 03 9012 3460 <tel:03%209012%203460> or 03 8376 6327

<tel:03%208376%206327> *|* * **Sydney* 02 8064 3600

<tel:02%208064%203600> *|*  *Brisbane* 07 3173 1570 <tel:07%203173%201570>

Level 1 *|*  600 Chapel Street *|* South

Yarra*|*  VIC *|*  3141 *|*  Australia

<https://www.facebook.com/TryBooking/> <https://twitter.com/trybooking> <https://www.linkedin.com/company/trybooking-com>

On 28 January 2017 at 12:05, Adrian Klaver <adrian.klaver@xxxxxxxxxxx

<mailto:adrian.klaver@aklaver.com>> wrote:

    On 01/27/2017 01:31 PM, Brian Mills wrote:

        Hi,

        I have a Atlassian Confluence Wiki that depends on postgres, but I

        haven't much experience with postgres other than for this purpose.

        A few days ago, the hard disk filled, so all services stopped

        working.

        When the admin realised this he increased the disk size (its in

        a cloud,

        so that was easy to do) however I think the way it shutdown left

        Postgres in an inconsistent state for some reason.

        Now when I start it up I get this message in the log over and

        over again:

        "FATAL:  the database system is starting up"

        I have a backup, which I have successfully recovered, but it is

        24 hours

        old, the next backup was the cause of the disk filling. So I'm using

        this as exercise in learning a bit more about postgres.

        I did some research and found a number of options. I took a file

        level

        backup with the service not running then tried 2 things. I

        haven't found

        much else on what to do though.

        *Attempt 1 - Reset Log *

        It sounded like this shouldn't be my first option (it wasn't)

        but it did

        sound like what I needed to do.

        I ran this command

        ./pg_resetxlog /var/lib/postgresql/9.3/main -f

        It worked a treat, the database did startup ok.

        However when I tried to dump the DB:

        root@atlassian:/home/myuser# sudo -u postgres pg_dump confluencedb >

        $now-confluencedb.sql

        pg_dump: Dumping the contents of table "bodycontent" failed:

        PQgetResult() failed.

        pg_dump: Error message from server: ERROR:  unexpected chunk

        size 104

        (expected 1996) in chunk 3 of 22 for toast value 48862 in

        pg_toast_20028

        pg_dump: The command was: COPY public.bodycontent

        (bodycontentid, body,

        contentid, bodytypeid) TO stdout;

        The dump failed, so I assume this did leave my database in an

        inconsistent state.

        *Attempt 2 -  startup manually and let it try recovery*

        I restored my file level backup and started again.

        This time I tried to startup manually on the command line to see the

        output (I'd done it as a service startup a number of times to

        nearly the

        same effect too)

        postgres@atlassian:/usr/lib/postgresql/9.3/bin$ ./pg_ctl -D

        /etc/postgresql/9.3/main start

        server starting

        postgres@atlassian:/usr/lib/postgresql/9.3/bin$ 2017-01-27

        20:36:08 AEDT

        LOG:  database system was interrupted while in recovery at

        2017-01-27

        20:13:26 AEDT

        2017-01-27 20:36:08 AEDT HINT:  This probably means that some

        data is

        corrupted and you will have to use the last backup for recovery.

        2017-01-27 20:36:08 AEDT LOG:  database system was not properly shut

        down; automatic recovery in progress

        2017-01-27 20:36:08 AEDT LOG:  redo starts at 5/528B4558

        2017-01-27 20:36:18 AEDT LOG:  redo done at 5/A3FFF9E8

        2017-01-27 20:36:18 AEDT LOG:  last completed transaction was at log

        time 2017-01-24 02:08:00.023064+11

        2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

        2017-01-27 23:00:01 AEDT FATAL:  the database system is starting up

    What does ps ax | grep post show?

    Is the cluster set up to log to a file, if so what does it show?

    Does the system log show anything relevant?

        I've left it that way for 12 hours, and its still not allowing

        connections.

        I assume its doing some kind of consistency check?

    What does it say when you attempt a connection?

        Does anyone have any suggestions on what I should be doing to

        try and

        restore this database?

        - The amount of change is minimal in the DB after 6pm it should be

        basically no change overnight.

        - The log above seems to suggest it has completed a redo ok, is

        that right?

        - The last completed transaction time 2017-01-24

        02:08:00.023064+11 the

        log mentions would be fine to use, so loosing even a few hours

        before

        that would be more than adequate.

        I'm just not clear what the database is doing now, or how I

        should be

        trying to recover it.

        Any help anyone can suggest would be great! I've given myself this

        weekend to attempt to recover more than the last backup, after

        that I

        need to restore the service for my team to use and suck up the

        lost last

        day of updates.

        Thanks,

        Brian

    --

    Adrian Klaver

    adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@aklaver.com>

-- 

Adrian Klaver

adrian.klaver@xxxxxxxxxxx