Re: After upgrade to 9.3, streaming replication fails to start

Jeff Ross <jross@xxxxxxxxxx> · Wed, 06 Nov 2013 12:26:08 -0700



    On 11/6/13, 11:32 AM, Jeff Janes wrote:

    
      On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross@xxxxxxxxxx>
        wrote:
        

                _postgresql@nirvana:/var/postgresql $ cat
                start_hot_standby.sh

                #!/bin/sh

                backup_label=wykids_`date +%Y-%m-%d`

                #remove any existing wal files on the standby

                ssh dukkha.internal rm -rf /wal/*

                #stop the standby server if it is running

                ssh dukkha.internal sudo /usr/local/bin/svc -d
                /service/postgresql.5432

                psql -c "select pg_start_backup('$backup_label');"
                template1

                rsync \

                        --copy-links \

                        --delete \

                        --exclude=backup_label \

              
              Excluding backup_label is exactly the wrong thing to
                do.  The only reason backup_label is created in the
                first place is so that it can be copied to the replica,
                where it is needed.  It's existence on the master is a
                nuisance.
              

                        --exclude=postgresql.conf \

                        --exclude=recovery.done \

                        -e ssh -avz /var/postgresql/data.93.5432/ \

                        dukkha.internal:/var/postgresql/data.93.5432/

                ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*

                ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/archive_status/*

                ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*

                ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid

                ssh dukkha.internal ln -s /var/postgresql/recovery.conf
                /var/postgresql/data.93.5432/recovery.conf

                psql -c "select pg_stop_backup();" template1

                ssh dukkha.internal sudo /usr/local/bin/svc -u
                /service/postgresql.5432

                
                _postgresql@nirvana:/var/postgresql $ sh -x
                start_hot_standby.sh

                + date +%Y-%m-%d

                + backup_label=wykids_2013-11-06

                + ssh dukkha.internal rm -rf /wal/*

                + ssh dukkha.internal sudo /usr/local/bin/svc -d
                /service/postgresql.5432

                + rsync -e ssh /wal/ dukkha.internal:/wal/

                skipping directory .

              
              Where is the above rsync coming from?  It doesn't
                seem to be in the shell script you showed.
              

              Anyway, I think you need to copy the wal over after
                you call pg_stop_backup, not before you call
                pg_start_backup.
              

              Cheers,
              

              Jeff
            
          
    Hi Jeff,

    
    Thanks for the reply.  Oops, I copied one of the many changes to the
    script, but not the one with the rsync to copy /wal from the primary
    to the standby.

    
    I should have mentioned that wal archiving is setup and working from
    the primary to the standby.  It saves wal both on the locally on the
    primary and remotesly on the standby.

    
    I moved the rsync line to copy wal from primary to secondary after
    pg_stop_backup but I'm still getting the same panic on the standby.

    
    Here's the real, honest version of the script I use to start the hot
    standby:

    
    _postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh

    #!/bin/sh

    backup_label=wykids_`date +%Y-%m-%d`

    #remove any existing wal files on the secondary

    ssh dukkha.internal "rm -rf /wal/*"

    ssh dukkha.internal sudo /usr/local/bin/svc -d
      /service/postgresql.5432

    psql -c "select pg_start_backup('$backup_label');"
      template1

    rsync \

            --copy-links \

            --delete \

            --exclude=backup_label \

            --exclude=postgresql.conf \

            --exclude=recovery.done \

            -e ssh -avz /var/postgresql/data.93.5432/ \

            dukkha.internal:/var/postgresql/data.93.5432/

    ssh dukkha.internal "rm -f
      /var/postgresql/data.93.5432/pg_xlog/*"

    ssh dukkha.internal "rm -f
      /var/postgresql/data.93.5432/pg_xlog/archive_status/*"

    ssh dukkha.internal "rm -f
      /var/postgresql/data.93.5432/pg_log/*"

    ssh dukkha.internal "rm -f
      /var/postgresql/data.93.5432/postmaster.pid"

    ssh dukkha.internal "ln -s /var/postgresql/recovery.conf
      /var/postgresql/data.93.5432/recovery.conf"

    psql -c "select pg_stop_backup();" template1

    rsync -e ssh -avz /wal/ dukkha.internal:/wal/

    ssh dukkha.internal sudo /usr/local/bin/svc -u
      /service/postgresql.5432

    
    Here are the logs on the standby after running the above:

    
    2013-11-06 11:56:30.792461500 <%> LOG:  database system
      was interrupted; last known up at 2013-11-06 11:52:22 MST

    2013-11-06 11:56:30.800685500 <%> LOG:  entering
      standby mode

    2013-11-06 11:56:30.800891500 <%> LOG:  invalid
      primary checkpoint record

    2013-11-06 11:56:30.800930500 <%> LOG:  invalid
      secondary checkpoint record

    2013-11-06 11:56:30.801004500 <%> PANIC:  could not
      locate a valid checkpoint record

    
    Jeff