Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Wed, 7 Jun 2017 18:01:49 +0300



    On 07/06/2017 17:49, Harry Ambrose
      wrote:

    
      Hi,

      Out of interest, are you using any tablespaces other
        than pg_default? I can only replicate the issue when using
        separately mounted tablespaces.
    
    One lesson I learned from the BSD camp when dealing with random
    freezes and panics : when all else fails to give an answer it is
    time to start blaming my hardware. Are those tablespaces on any
    cheap SSD's ? 

    
      I have been investigating this quite extensively and
        everything I can find on the web suggests data corruption.
        However running the the following DO reports no errors and I can
        dump the database without issue.
      

    You don't use index when pg_dump . If only the index is corrupted
    you can get away with dump/reload (but for big DBs this is
    unrealistic)

    
      I also found the following has been reported: https://www.postgresql.org/message-id/20161201165505.4360.28203@xxxxxxxxxxxxxxxxxxxxxxx
      

      Best wishes,
      Harry
      

          On 7 Jun 2017, at 15:22, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx>
            wrote:
          

          On 07/06/2017 16:33, ADSJ
              (Adam Sjøgren) wrote:

            Our database has
              started reporting errors like this:

              
                2017-05-31 13:48:10 CEST ERROR:  unexpected chunk number
              0 (expected 1) for toast value 14242189 in
              pg_toast_10919630

                    ...

                2017-06-01 11:06:56 CEST ERROR:  unexpected chunk number
              0 (expected 1) for toast value 19573520 in
              pg_toast_10919630

              
              (157 times, for different toast values, same
              pg_toast_nnn). pg_toast_10919630

              corresponds to a table with around 168 million rows.

              
              These went away, but the next day we got similar errors
              from another

              table:

              
                2017-06-02 05:59:50 CEST ERROR:  unexpected chunk number
              0 (expected 1) for toast value 47060150 in
              pg_toast_10920100

                    ...

                2017-06-02 06:14:54 CEST ERROR:  unexpected chunk number
              0 (expected 1) for toast value 47226455 in
              pg_toast_10920100

              
              (Only 4 this time) pg_toast_10920100 corresponds to a
              table with holds

              around 320 million rows (these are our two large tables).

              
              The next day we got 6 such errors and the day after 10
              such errors. On

              June 5th we got 94, yesterday we got 111, of which one
              looked a little

              different:

              
                2017-06-06 17:32:21 CEST ERROR:  unexpected chunk size
              1996 (expected 1585) in final chunk 0 for toast value
              114925100 in pg_toast_10920100

              
              and today the logs have 65 lines, ending with these:

              
                2017-06-07 14:49:53 CEST ERROR:  unexpected chunk number
              0 (expected 1) for toast value 131114834 in
              pg_toast_10920100

                2017-06-07 14:53:41 CEST ERROR:  unexpected chunk number
              0 (expected 1) for toast value 131149566 in
              pg_toast_10920100

            
            First try to find which
              tables those toast relations refer to :

            select 10919630::regclass ,
              10920100::regclass ;

            Are those critical tables?
              Can you restore them somehow?

            
            Also you may consider

            REINDEX TABLE
              pg_toast.pg_toast_10920100;

            REINDEX TABLE
              pg_toast.pg_toast_10919630;

            REINDEX TABLE <name of
              table 10920100>;

            REINDEX TABLE <name of
              table 10919630>;

            
            also VACUUM the above tables.

            
            You might want to write a
              function which iterates over the damaged table's rows in
              order to identify the damaged row(s). And then do some
              good update to create a new version.

            
            The database is
              10 TB on disk (SSDs) and runs on a 48 core server with 3

              TB RAM on Ubuntu 14.04 (Linux 3.18.13).

              
              We are updating rows in the database a lot/continuously.

              
              There are no apparent indications of hardware errors (like
              ECC) in

              dmesg, nor any error messages logged by the LSI MegaRAID
              controller, as

              far as I can tell.

              
              We are running PostgreSQL 9.3.14 currently.

              
              The only thing I could see in the release notes since
              9.3.14 that might

              be related is this:

              
               "* Avoid very-low-probability data corruption due to
              testing tuple

                  visibility without holding buffer lock (Thomas Munro,
              Peter Geoghegan,

                  Tom Lane)"

              
              Although reading more about it, it doesn't sound like it
              would exhibit

              the symptoms we see?

              
              We have recently increased the load (to around twice the
              number of

              cores), though, which made me think we could be triggering
              corner cases

              we haven't hit before.

              
              We will be upgrading to PostgreSQL 9.3.17 during the
              weekend, but I'd like to hear

              if anyone has seen something like this, or have some ideas
              of how to

              investigate/what the cause might be.

              
                Best regards,

              
                  Adam

              
            -- 

            Achilleas Mantzios

            IT DEV Lead

            IT DEPT

            Dynacom Tankers Mgmt

            
            -- 

            Sent via pgsql-general
              mailing list (pgsql-general@xxxxxxxxxxxxxx)

            To make changes to your
              subscription:

            http://www.postgresql.org/mailpref/pgsql-general
        
      
    -- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt