Stand by server (9.6.6) with corrupt file

Edson Carlos Ericksson Richter <richter@xxxxxxxxxxxxxx> · Fri, 22 Dec 2017 11:29:39 -0200



    Environment: PostgreSQL 9.6.6 installed from yum repository.
      Oracle Linux 7 EL x64. Dell servers with Raid 5 (hw).
    

    I was testing our database backup system (based on pgBarman), and
      discovered that one base file is corrupt in our standby database
      server. The file is OK in master server, but has 0 bytes in size
      in standby server.
    Looking master and standby servers there is no indication that
      the problem exists - replication is running fine.
    

    Evidences:
    On master server:
    [root@server2 1106839]#
      find 6302536 -exec stat \{\} \;
      File: “6302536”
      Size: 16793600       
      Blocks: 32800      IO Block: 4096   arquivo comum
    Device: f902h/63746d   
      Inode: 10618465    Links: 1
    Access:
      (0600/-rw-------)  Uid: (   26/postgres)   Gid: (   26/postgres)
    Access: 2017-12-08
      21:35:38.670841051 -0200
    Modify: 2017-12-21
      22:51:40.706074439 -0200
    Change: 2017-12-21
      22:51:40.706074439 -0200
     Birth: -
    

    On standby server:
    [root@server3 1106839]#
      find 6302536 -exec stat \{\} \;
      File: “6302536”
      Size: 0             
       Blocks: 0          IO Block: 4096   arquivo comum vazio
    Device: f901h/63745d   
      Inode: 391519656   Links: 1
    Access:
      (0600/-rw-------)  Uid: (   26/postgres)   Gid: (   26/postgres)
    Access: 2017-12-09
      15:50:47.469135640 -0200
    Modify: 2017-12-09
      15:50:47.469135640 -0200
    Change: 2017-12-09
      15:50:47.469135640 -0200
     Birth: -
    

    After long investigation, I discovered that if I execute a query
      on standby server:
    < 2017-12-22 11:20:22.417 -02 > ERROR:  could not read
      block 0 in file "base/1106839/6302536": read only 0 of 8192 bytes

      < 2017-12-22 11:20:22.417 -02 > STATEMENT:  SELECT *

                FROM MY_FAIR_LARGE_TABLE t1

      
               LEFT OUTER JOIN MY_FAIR_LARGE_SUBTABLE t0 ON (t0.the_id =
      t1.ID)

                         WHERE (((t1.COMPANY_ID = 2)

                           AND t1.OTHERCOMPANY LIKE '20147617%')

                           AND (t1.TEST_FLAG = 0))

              ORDER BY t1.DUE_DATE LIMIT 1000 OFFSET 0

    
    Very same query on server works fine.
    And there is no replication error - everything is in sync between
      these two servers (I know, I'm begin to be repetitive).
    I've about 30 servers with same setup, and this only has this
      flaw. The only difference is that this database is about 3 times
      larger than the others (about 90Gb in size).
    Server and slave have 23ms of network lag - which seems not be a
      problem for the other databases in the same server.

    
    Any advice?

    
    -- 

      
                Edson Carlos Ericksson Richter
              

              SimKorp Ltda
            
          
            Fone:
            (51)
                3366-7964
          
          
              "A mente que se abre a uma nova ideia jamais voltará ao
                seu tamanho original"

              - Albert Einstein