Re: Trying to recover a corrupted database

John Scalia <jayknowsunix@xxxxxxxxx> · Thu, 17 Jul 2014 16:26:16 -0400



    Well, nothing to lose, so I did a touch on pg_clog/0CDE and at least
    the error changed. Now it says:

    
    DETAIL: Could not read from file "pg_clog/0CDE" at offset 98304:
    Success.

    
    But I don't understand why it says Success?

    
    On 7/17/2014 4:09 PM, Scott Whitney
      wrote:

    
        OOhhh...IIRC I think that I created the clog file as a zero
          byte file to get around that. It was _absolutely_ a step of
          last resort. I would be certain that you have a backup that
          can get you back to this point at least before proceeding.
        

        http://kb.parallels.com/en/6067
        

          I was attempting to do that with one table that I know is
          broken, but I can't get anything from it in a select. All I
          get no matter what is:

          
          ERROR: could not access status of transaction 3706060803

          DETAIL: Could not open file "pg_clog/OCDE": No such file or
          directory

          
          I think the pg_clog files are transaction journal files that
          tell the Db why a record is or is not in the table, but I'm
          not certain. I just wish there was a pg_resetclog executable
          like there is for the pg_xlog segments.

          
          On 7/17/2014 3:58 PM, Scott
            Whitney wrote:

          
              Yeah. I know the feeling, man. What I ended up doing
                was "select * from" until I found the broken one then
                jumped down into groups of thousands of records and
                basically "Newtoned" it down until I found the affected
                data.
              

              It sucked. A lot.
              

                Well, we already have the schema in a canned file. So,
                we wouldn't need to do a pg_dump for the schema, and I
                can get into the database using psql obviously. That's
                how I produced the setting from pg_settings. But, I
                really don't want to have to go through 150K+ records to
                find the multiple broken ones, if I don't have to do so.
                I'm just hoping for a faster way.

                
                On 7/17/2014 3:30 PM, Scott
                  Whitney wrote:

                
                    Well, at this point, let me let the big brains
                      answer your main question on the thread and tell
                      you what _I_ did when I got into a similar
                      situation.
                    

                    a) pg_dump --schema-only <database name>
                        >  db.structure
                    b) From here it was a lot of select * from
                      table until I found the broken one(s) and I worked
                      around the actual bad data by offset/limit on the
                      selects on the busted tables.
                    

                    It took me several hours, but it did work (for
                      me).
                    

                    That's assuming you can even psql to the db in
                      the first place when the postmaster is running.
                    

                      We're only working with a single database on this
                      system, and yes, I attempted a pg_dump earlier
                      today on just one single table - it failed (I know
                      pg_dump and pg_basebackup use different
                      mechanisms.) Mind you it's a large table with
                      154,000 rows in it. The overall database is
                      somewhere around 43GB.

                      
                      On 7/17/2014 3:16 PM,
                        Scott Whitney wrote:

                      
                            a) How many databases do you have?
                            b) Have you tried to pg_dump the
                              database(s) to see if that succeeds?
                            

                            If you _can_ pg_dump (which you might
                              or might not be able to do), you could
                              re-init the cluster and restore.
                          
                          
                          Hi


                            all,

                            
                            You may have seen my post from yesterday
                            about our production database getting
                            corrupted. Well, this morning we brought the
                            system down to single user and ran an fsck
                            which did 

                            report some drive errors. We repeated until
                            no additional errors were reported. Then, we
                            brought the system back to multi-user status
                            and ran a successful pg_basebackup on the 

                            broken database. Since then we restarted the
                            database and a ps -ef result looks like:

                            
                            /usr/pgsql-9.2/bin/postmaster -D
                            /opt/datacenter -o -c
                            zero_damaged_pages=true -i -N 384 -p 5431

                            
                            After the Db started up, we ran a VACUUM
                            FULL ANALYZE which ran for about 3 hours,
                            but the database is still showing the same
                            type of errors in its log: invalid page
                            header in 

                            block 29718... etc. What disturbed me a
                            little, is that I don't think the
                            zero_damaged_pages got applied. Checking the
                            pg_settings table, we got:

                            
                            select name, setting, boot_val, reset_val
                            from pg_settings where name =
                            'zero_damaged_pages';

                                               name             |
                             setting  |  boot_val  | reset_val

---------------------------------------------------------------------------------

                              zero_damaged_pages  |  on          | off  
                                       | on

                            
                            Now, my colleague ran this after he tried
                            running some operations again after I told
                            him how to set zero_damaged_pages again. He
                            swears that that it was on when the first
                            VACUUM 

                            FULL ANALYZE was run, but I'm not as sure.
                            Plus, I don't understand why the boot_val
                            shows as off. In any event, as we're still
                            getting log errors like before, I don't
                            really know 

                            what to try next other than rerunning the
                            VACUUM FULL again. Help?

                            --

                            Jay

                            
                            -- 

                            Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)

                            To make changes to your subscription:

                            http://www.postgresql.org/mailpref/pgsql-admin