Hi, unfortunately your message is missing some pretty important information, like for example which OS you're running, what PostgreSQL version are you using (and if you built it on your own or installed from some package repository). Also, you mention replicas. What type of replication are you using? The built-in streaming replication, or something else? On 01/31/2018 06:56 PM, Eugene Pirogov wrote: > Hi, > > We've experienced a DB incident related to WAL. The issue is now > somewhat resolved, and I'm trying to understand what could have caused > the problem to occur in the first place. Below is a full chronicle of an > incident we've ran into. > > At some point we realized almost no new data was coming in the DB during > the day. We quickly looked and realized the following: > > * our application was still writing data, unaware that there any > issues on the DB end, like if nothing happened at all, > * the database was accepting connections (we were able to connect and > query), > * the database was writing regular text logs, > * the database was not writing new data to the tables, > * the logical DB replica actually had the expected records replicated > and present in the replica tables, > * so we made assumption original source WAL data existed and was > correct (given the fact logical DB is relying on WAL). > > We tried to restart the DB, but it would not come up. Instead, a few > core dumps in the postgres data folder were generated during our attempts. > Please inspect the core dumps using gdb and share the backtraces. There are instructions how to do that on the wiki: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD > This is what we saw in the postgres logs: > > LOG: invalid resource manager ID 112 at 3/89005B10 > LOG: invalid primary checkpoint record > LOG: invalid resource manager ID 138 at 3/89005208 > LOG: invalid secondary checkpoint record > PANIC: could not locate a valid checkpoint record > > > To aid this we used pg_resetxlog. After running the pg_resetxlog > command, the DB successfully started and slowly began recovering the > data from WAL, automatically. During the recovery process, we ran VACUUM > ANALYSE couple of time, which seems to have somehow aided the speed of > recovery process (though I could have been a confirmation bias). > Firstly, I very much doubt using pg_resetxlog is a good idea, especially when it's rather unclear what was the actual root cause. Secondly, I don't quite understand what you mean by "recovery" here. The regular recovery process (say, after a crash) makes it impossible to connect to the database, so I'm not sure what you mean by running vacuum analyze (not to mention how it could have improved the recovery speed). > The DB was fully recovered in 30 minutes. > > Upon closer inspection, however, we realized we've lost two records from > a particular table. The records were replicated to logical replica > though, and further inspection showed their insertion timestamp sometime > during the day, long after the database supposedly stopped applying new > data from WAL to tables. > > We don't know the reason why these two records were gone (no DELETE > statements ever happen in the table in question). The records in > question existed in replica table, luckly. > Well, considering you did pg_resetxlog ... I'm surprised you only lost two records, but it's a question what state are the indexes in etc. > The logs for the entire day when the incident happened are available > here > <https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6 > Mb). Each log line was prepended with a timestamp. Also, we have the > core dumps that were generated during attempts to restart the database. > Man, that's an extremely noisy log file. There are errors caused by broken pg_hba.conf entries and so on, which makes it extremely difficult to look for possible other errors ... I'm not going to waste time by doing that. > My questions are: > > * what would be some typical reasons for primary/secondary WAL > checkpoints to fail? Disk issues? File system issues? OS issues? I'm > running out of ideas here. > > Is there any way I could separate primary and secondary checkpoints > in space, so to increase postgres's chances to restore one of them? I don't know. You have not provided any information about the storage system, and we don't have access to system logs that might show if something happened at this level. But in general yes, a storage issue might cause this. > > * how come postgresql would stop applying WAL logs to tables, and yet > keep silence? There was no indication WAL is not being applied, > postgres log would be silent (even though it kept writing other > commodity messages, such as "incomplete startup packet" and > malformed query attempts). > > Due to this, we noticed the issue only hours after postgres > supposedly stopped applying the WAL logs to tables. > It wouldn't. But it's unclear how you came to that conclusion. > * Is there any other log (like some lower level system log?) where I > could see the status of writing / reading / applying WAL segments? > There's only one log for PostgreSQL, so it would have to be in that file depending on the log level. > * What measures could be done to prevent this in future, or at least > make the situation easier to recover from (e.g. without data loss)? > The jury is still out on the root cause, so it's too soon to say. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services