Re: Trying to recover a corrupted database

John Scalia <jayknowsunix@xxxxxxxxx> · Thu, 17 Jul 2014 15:40:31 -0400



    Well, we already have the schema in a canned file. So, we wouldn't
    need to do a pg_dump for the schema, and I can get into the database
    using psql obviously. That's how I produced the setting from
    pg_settings. But, I really don't want to have to go through 150K+
    records to find the multiple broken ones, if I don't have to do so.
    I'm just hoping for a faster way.

    
    On 7/17/2014 3:30 PM, Scott Whitney
      wrote:

    
        Well, at this point, let me let the big brains answer your
          main question on the thread and tell you what _I_ did when I
          got into a similar situation.
        

        a) pg_dump --schema-only <database name>   >
           db.structure
        b) From here it was a lot of select * from table until I
          found the broken one(s) and I worked around the actual bad
          data by offset/limit on the selects on the busted tables.
        

        It took me several hours, but it did work (for me).
        

        That's assuming you can even psql to the db in the first
          place when the postmaster is running.
        

          We're only working with a single database on this system, and
          yes, I attempted a pg_dump earlier today on just one single
          table - it failed (I know pg_dump and pg_basebackup use
          different mechanisms.) Mind you it's a large table with
          154,000 rows in it. The overall database is somewhere around
          43GB.

          
          On 7/17/2014 3:16 PM, Scott
            Whitney wrote:

          
                a) How many databases do you have?
                b) Have you tried to pg_dump the database(s) to see
                  if that succeeds?
                

                If you _can_ pg_dump (which you might or might not
                  be able to do), you could re-init the cluster and
                  restore.
              
              
              Hi

                all,

                
                You may have seen my post from yesterday about our
                production database getting corrupted. Well, this
                morning we brought the system down to single user and
                ran an fsck which did 

                report some drive errors. We repeated until no
                additional errors were reported. Then, we brought the
                system back to multi-user status and ran a successful
                pg_basebackup on the 

                broken database. Since then we restarted the database
                and a ps -ef result looks like:

                
                /usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c
                zero_damaged_pages=true -i -N 384 -p 5431

                
                After the Db started up, we ran a VACUUM FULL ANALYZE
                which ran for about 3 hours, but the database is still
                showing the same type of errors in its log: invalid page
                header in 

                block 29718... etc. What disturbed me a little, is that
                I don't think the zero_damaged_pages got applied.
                Checking the pg_settings table, we got:

                
                select name, setting, boot_val, reset_val from
                pg_settings where name = 'zero_damaged_pages';

                                   name             |  setting  |
                 boot_val  | reset_val

---------------------------------------------------------------------------------

                  zero_damaged_pages  |  on          | off            
                 | on

                
                Now, my colleague ran this after he tried running some
                operations again after I told him how to set
                zero_damaged_pages again. He swears that that it was on
                when the first VACUUM 

                FULL ANALYZE was run, but I'm not as sure. Plus, I don't
                understand why the boot_val shows as off. In any event,
                as we're still getting log errors like before, I don't
                really know 

                what to try next other than rerunning the VACUUM FULL
                again. Help?

                --

                Jay

                
                -- 

                Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)

                To make changes to your subscription:

                http://www.postgresql.org/mailpref/pgsql-admin