Re: Trying to recover a corrupted database

John Scalia <jayknowsunix@xxxxxxxxx> · Thu, 17 Jul 2014 16:05:52 -0400



    I was attempting to do that with one table that I know is broken,
    but I can't get anything from it in a select. All I get no matter
    what is:

    
    ERROR: could not access status of transaction 3706060803

    DETAIL: Could not open file "pg_clog/OCDE": No such file or
    directory

    
    I think the pg_clog files are transaction journal files that tell
    the Db why a record is or is not in the table, but I'm not certain.
    I just wish there was a pg_resetclog executable like there is for
    the pg_xlog segments.

    
    On 7/17/2014 3:58 PM, Scott Whitney
      wrote:

    
        Yeah. I know the feeling, man. What I ended up doing was
          "select * from" until I found the broken one then jumped down
          into groups of thousands of records and basically "Newtoned"
          it down until I found the affected data.
        

        It sucked. A lot.
        

          Well, we already have the schema in a canned file. So, we
          wouldn't need to do a pg_dump for the schema, and I can get
          into the database using psql obviously. That's how I produced
          the setting from pg_settings. But, I really don't want to have
          to go through 150K+ records to find the multiple broken ones,
          if I don't have to do so. I'm just hoping for a faster way.

          
          On 7/17/2014 3:30 PM, Scott
            Whitney wrote:

          
              Well, at this point, let me let the big brains answer
                your main question on the thread and tell you what _I_
                did when I got into a similar situation.
              

              a) pg_dump --schema-only <database name>   >
                 db.structure
              b) From here it was a lot of select * from table
                until I found the broken one(s) and I worked around the
                actual bad data by offset/limit on the selects on the
                busted tables.
              

              It took me several hours, but it did work (for me).
              

              That's assuming you can even psql to the db in the
                first place when the postmaster is running.
              

                We're only working with a single database on this
                system, and yes, I attempted a pg_dump earlier today on
                just one single table - it failed (I know pg_dump and
                pg_basebackup use different mechanisms.) Mind you it's a
                large table with 154,000 rows in it. The overall
                database is somewhere around 43GB.

                
                On 7/17/2014 3:16 PM, Scott
                  Whitney wrote:

                
                      a) How many databases do you have?
                      b) Have you tried to pg_dump the database(s)
                        to see if that succeeds?
                      

                      If you _can_ pg_dump (which you might or
                        might not be able to do), you could re-init the
                        cluster and restore.
                    
                    
                    Hi


                      all,

                      
                      You may have seen my post from yesterday about our
                      production database getting corrupted. Well, this
                      morning we brought the system down to single user
                      and ran an fsck which did 

                      report some drive errors. We repeated until no
                      additional errors were reported. Then, we brought
                      the system back to multi-user status and ran a
                      successful pg_basebackup on the 

                      broken database. Since then we restarted the
                      database and a ps -ef result looks like:

                      
                      /usr/pgsql-9.2/bin/postmaster -D /opt/datacenter
                      -o -c zero_damaged_pages=true -i -N 384 -p 5431

                      
                      After the Db started up, we ran a VACUUM FULL
                      ANALYZE which ran for about 3 hours, but the
                      database is still showing the same type of errors
                      in its log: invalid page header in 

                      block 29718... etc. What disturbed me a little, is
                      that I don't think the zero_damaged_pages got
                      applied. Checking the pg_settings table, we got:

                      
                      select name, setting, boot_val, reset_val from
                      pg_settings where name = 'zero_damaged_pages';

                                         name             |  setting  |
                       boot_val  | reset_val

---------------------------------------------------------------------------------

                        zero_damaged_pages  |  on          | off        
                           | on

                      
                      Now, my colleague ran this after he tried running
                      some operations again after I told him how to set
                      zero_damaged_pages again. He swears that that it
                      was on when the first VACUUM 

                      FULL ANALYZE was run, but I'm not as sure. Plus, I
                      don't understand why the boot_val shows as off. In
                      any event, as we're still getting log errors like
                      before, I don't really know 

                      what to try next other than rerunning the VACUUM
                      FULL again. Help?

                      --

                      Jay

                      
                      -- 

                      Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)

                      To make changes to your subscription:

                      http://www.postgresql.org/mailpref/pgsql-admin