Search Postgresql Archives

Re: Issue with WAL logs temporary not replaying

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/31/2018 09:56 AM, Eugene Pirogov wrote:
Hi,

We've experienced a DB incident related to WAL. The issue is now somewhat resolved, and I'm trying to understand what could have caused the problem to occur in the first place. Below is a full chronicle of an incident we've ran into.

At some point we realized almost no new data was coming in the DB during the day. We quickly looked and realized the following:

  * our application was still writing data, unaware that there any
    issues on the DB end, like if nothing happened at all,
  * the database was accepting connections (we were able to connect and
    query),
  * the database was writing regular text logs,
  * the database was not writing new data to the tables,
  * the logical DB replica actually had the expected records replicated
    and present in the replica tables,
  * so we made assumption original source WAL data existed and was
    correct (given the fact logical DB is relying on WAL).

We tried to restart the DB, but it would not come up. Instead, a few core dumps in the postgres data folder were generated during our attempts.

This is what we saw in the postgres logs:

    LOG:  invalid resource manager ID 112 at 3/89005B10
    LOG:  invalid primary checkpoint record
    LOG:  invalid resource manager ID 138 at 3/89005208
    LOG:  invalid secondary checkpoint record
    PANIC:  could not locate a valid checkpoint record


To aid this we used pg_resetxlog. After running the pg_resetxlog command, the DB successfully started and slowly began recovering the data from WAL, automatically. During the recovery process, we ran VACUUM ANALYSE couple of time, which seems to have somehow aided the speed of recovery process (though I could have been a confirmation bias).

The DB was fully recovered in 30 minutes.

Upon closer inspection, however, we realized we've lost two records from a particular table. The records were replicated to logical replica though, and further inspection showed their insertion timestamp sometime during the day, long after the database supposedly stopped applying new data from WAL to tables.

We don't know the reason why these two records were gone (no DELETE statements ever happen in the table in question). The records in question existed in replica table, luckly.

The logs for the entire day when the incident happened are available here <https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6 Mb). Each log line was prepended with a timestamp. Also, we have the core dumps that were generated during attempts to restart the database.

What the role of the user 'db'?

Any idea what is causing this?:

WARNING:  concurrent insert in progress within table "audit_log"
...repeated... many times

WARNING: concurrent insert in progress within table "declarations_status_hstr"
...repeated many times

Are either of these tables the one that had the two missing records?


My questions are:

  * what would be some typical reasons for primary/secondary WAL
    checkpoints to fail? Disk issues? File system issues? OS issues? I'm
    running out of ideas here.

    Is there any way I could separate primary and secondary checkpoints
    in space, so to increase postgres's chances to restore one of them?

  * how come postgresql would stop applying WAL logs to tables, and yet
    keep silence? There was no indication WAL is not being applied,
    postgres log would be silent (even though it kept writing other
    commodity messages, such as "incomplete startup packet" and
    malformed query attempts).

    Due to this, we noticed the issue only hours after postgres
    supposedly stopped applying the WAL logs to tables.

  * Is there any other log (like some lower level system log?) where I
    could see the status of writing / reading / applying WAL segments?

  * What measures could be done to prevent this in future, or at least
    make the situation easier to recover from (e.g. without data loss)?

Thank you in advance.

--
Eugene

http://www.gmile.me


--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux