Re: After upgrade to 9.3, streaming replication fails to start--SOLVED

Jeff Ross <jross@xxxxxxxxxx> · Thu, 07 Nov 2013 09:29:23 -0700



    On 11/6/13, 12:26 PM, Jeff Ross wrote:

    
      On 11/6/13, 11:32 AM, Jeff Janes
        wrote:

      
        On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross@xxxxxxxxxx>
          wrote:
          

                  _postgresql@nirvana:/var/postgresql $ cat
                  start_hot_standby.sh

                  #!/bin/sh

                  backup_label=wykids_`date +%Y-%m-%d`

                  #remove any existing wal files on the standby

                  ssh dukkha.internal rm -rf /wal/*

                  #stop the standby server if it is running

                  ssh dukkha.internal sudo /usr/local/bin/svc -d
                  /service/postgresql.5432

                  psql -c "select pg_start_backup('$backup_label');"
                  template1

                  rsync \

                          --copy-links \

                          --delete \

                          --exclude=backup_label \

                
                Excluding backup_label is exactly the wrong thing
                  to do.  The only reason backup_label is created in the
                  first place is so that it can be copied to the
                  replica, where it is needed.  It's existence on the
                  master is a nuisance.
                

                      --exclude=postgresql.conf \

                          --exclude=recovery.done \

                          -e ssh -avz /var/postgresql/data.93.5432/ \

                          dukkha.internal:/var/postgresql/data.93.5432/

                  ssh dukkha.internal rm -f
                  /var/postgresql/data.93.5432/pg_xlog/*

                  ssh dukkha.internal rm -f
                  /var/postgresql/data.93.5432/pg_xlog/archive_status/*

                  ssh dukkha.internal rm -f
                  /var/postgresql/data.93.5432/pg_log/*

                  ssh dukkha.internal rm -f
                  /var/postgresql/data.93.5432/postmaster.pid

                  ssh dukkha.internal ln -s
                  /var/postgresql/recovery.conf
                  /var/postgresql/data.93.5432/recovery.conf

                  psql -c "select pg_stop_backup();" template1

                  ssh dukkha.internal sudo /usr/local/bin/svc -u
                  /service/postgresql.5432

                  
                  _postgresql@nirvana:/var/postgresql $ sh -x
                  start_hot_standby.sh

                  + date +%Y-%m-%d

                  + backup_label=wykids_2013-11-06

                  + ssh dukkha.internal rm -rf /wal/*

                  + ssh dukkha.internal sudo /usr/local/bin/svc -d
                  /service/postgresql.5432

                  + rsync -e ssh /wal/ dukkha.internal:/wal/

                  skipping directory .

                
                Where is the above rsync coming from?  It doesn't
                  seem to be in the shell script you showed.
                

                Anyway, I think you need to copy the wal over after
                  you call pg_stop_backup, not before you call
                  pg_start_backup.
                

                Cheers,
                

                Jeff
              
            
      Hi Jeff,

      
      Thanks for the reply.  Oops, I copied one of the many changes to
      the script, but not the one with the rsync to copy /wal from the
      primary to the standby.

      
      I should have mentioned that wal archiving is setup and working
      from the primary to the standby.  It saves wal both on the locally
      on the primary and remotesly on the standby.

      
      I moved the rsync line to copy wal from primary to secondary after
      pg_stop_backup but I'm still getting the same panic on the
      standby.

      
      Here's the real, honest version of the script I use to start the
      hot standby:

      
      _postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh

      #!/bin/sh

      backup_label=wykids_`date +%Y-%m-%d`

      #remove any existing wal files on the secondary

      ssh dukkha.internal "rm -rf /wal/*"

      ssh dukkha.internal sudo /usr/local/bin/svc -d
        /service/postgresql.5432

      psql -c "select pg_start_backup('$backup_label');"
        template1

      rsync \

              --copy-links \

              --delete \

              --exclude=backup_label \

              --exclude=postgresql.conf \

              --exclude=recovery.done \

              -e ssh -avz /var/postgresql/data.93.5432/ \

              dukkha.internal:/var/postgresql/data.93.5432/

      ssh dukkha.internal "rm -f
        /var/postgresql/data.93.5432/pg_xlog/*"

      ssh dukkha.internal "rm -f
        /var/postgresql/data.93.5432/pg_xlog/archive_status/*"

      ssh dukkha.internal "rm -f
        /var/postgresql/data.93.5432/pg_log/*"

      ssh dukkha.internal "rm -f
        /var/postgresql/data.93.5432/postmaster.pid"

      ssh dukkha.internal "ln -s /var/postgresql/recovery.conf
        /var/postgresql/data.93.5432/recovery.conf"

      psql -c "select pg_stop_backup();" template1

      rsync -e ssh -avz /wal/ dukkha.internal:/wal/

      ssh dukkha.internal sudo /usr/local/bin/svc -u
        /service/postgresql.5432

      
      Here are the logs on the standby after running the above:

      
      2013-11-06 11:56:30.792461500 <%> LOG:  database system
        was interrupted; last known up at 2013-11-06 11:52:22 MST

      2013-11-06 11:56:30.800685500 <%> LOG:  entering
        standby mode

      2013-11-06 11:56:30.800891500 <%> LOG:  invalid
        primary checkpoint record

      2013-11-06 11:56:30.800930500 <%> LOG:  invalid
        secondary checkpoint record

      2013-11-06 11:56:30.801004500 <%> PANIC:  could not
        locate a valid checkpoint record

      
      Jeff

    
    My apologies to Jeff--I'd missed his in-line comment above that I
    should *not* exclude the backup label from the rsync of the primary
    to the standby.  As soon as I removed that exclusion and with his
    other suggested change that I should copy the /wal from the primary
    to the standby after pg_stop_backup, streaming replication started
    on the standby exactly as it should.

    
    Logs from the standby:

    
    2013-11-07 09:21:15.273712500 <%> LOG:  database system
      was interrupted; last known up at 2013-11-07 09:16:05 MST

    2013-11-07 09:21:15.286834500 <%> LOG:  entering
      standby mode

    2013-11-07 09:21:16.873654500 <%> LOG:  restored log
      file "000000010000000000000050" from archive

    2013-11-07 09:21:16.936355500 <%> LOG:  redo starts
      at 0/50000024

    2013-11-07 09:21:17.129718500 <%> LOG:  consistent
      recovery state reached at 0/50036D6C

    2013-11-07 09:21:17.131933500 <%> LOG:  database
      system is ready to accept read only connections

    2013-11-07 09:21:17.136856500 cp:
      /wal/000000010000000000000051: No such file or directory

    2013-11-07 09:21:17.194811500 <%> LOG:  started
      streaming WAL from primary at 0/51000000 on timeline 1

    
    Jeff Ross