Search Postgresql Archives

Re: PostgreSQL container crash trouble.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2021-03-08 at 22:30 +0900, Daisuke Ikeda wrote:
> I'm running PostgreSQL (enabled timescaledb extension) in the following environment.
> And I had db pod crashed error several times irregularly.
> 
> I want to know the cause of this fail. 
> 
> Environment:
> 
>   - Base: Kubernetes(microk8s)
>   - DB: PostgreSQL 12.4 (TimescaleDB) Container
>   - DB OS: Alpine Linux
>   - microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts)
>   - DB data location: Mounted host directory (for data persistence)
> 
> 1) PostgreSQL crash and cannot start Pod..
> 
> I noticed that the timescaledb pod was restarted repeatedly with the following error.
> 
> ---
> PostgreSQL Database directory appears to contain a database; Skipping initialization
> [1] LOG:  starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit
> [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
> [1] LOG:  listening on IPv6 address "::", port 5432
> [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
> [20] LOG:  database system was shut down at 2021-02-15 21:15:12 UTC
> [20] LOG:  invalid primary checkpoint record
> [20] PANIC:  could not locate a valid checkpoint record
> [1] LOG:  startup process (PID 20) was terminated by signal 6: Aborted
> [1] LOG:  aborting startup due to startup process failure
> [1] LOG:  database system is shut down
> ---
> 
> I thought it might be WAL trouble and tried to check with pg_controldata and pg_waldump.

It is WAL trouble.
WAL does not contain the checkpoint from before the crash.

> I thought it might be WAL trouble and tried to check with pg_controldata and pg_waldump.
>
> ---
> $ pg_controldata <PostgreSQL data dir>
> [...]
> Latest checkpoint location:           15/8FB002C0
> Latest checkpoint's REDO location:    15/8FAF3018
> Latest checkpoint's REDO WAL file:    00000001000000150000008F
> [...]
>
> I checked the latest checkpoint's REDO WAL file.
> 
> ---
> $ pg_waldump -n 10 pg_wal/00000001000000150000008F 
> pg_waldump: fatal: WAL segment size must be a power of two between 1 MB and 1 GB, but the WAL file "00000001000000150000008F" header specifies 0 bytes
> ---
> 
> I cannot read wal data.
> This file size is 16MB (according to the wal size setting).
> But the content is all zero data. I checked this situation with "od" command.
> 
> ---
> $ od -N 40 -A d -v pg_wal/00000001000000150000008F
> 0000000 000000 000000 000000 000000 000000 000000 000000 000000
> 0000016 000000 000000 000000 000000 000000 000000 000000 000000
> 0000032 000000 000000 000000 000000
> 0000040
> ---

Looks like modifications to this file were lost.

> 2) Run pg_resetwal and successful start Pod
> 
> I ran the pg_resetwal command to repair the WAL trouble.
> And the DB pod successfully started.

Yes, but "pg_resetwal" on a crashed cluster leads to data corruption.
The best you can do now is salvage what you can.

> But, I received select query fail for some tables.
> 
> 3) failed select query
> 
> ---
> sampledb1=# select * from table1 limit 1;
> ERROR:  missing chunk number 0 for toast value 27678 in pg_toast_2619
> ---
> 
> I thought taht this error is raised for pg_statistic table (and related toast table (pg_toast_2619)).
> So, I deleted the broken rows in pg_statistic table and ran ANALYZE query.
> After then, this tables is recovered.

Lucky you!

> 4) another data trouble
> 
> I had an another data trouble after pg_resetwal.
> 
> In some table, SELECT query did not return any rows.
> And INSERT query failed (no response. Waiting forever...)
> This table have only primary key index.
> I thought it might have been caused by an pkey index trouble after pg_resetwal.
> I didn't know how to repair the index of the primary key, and finally I dropped table and restore.

You should "pg_dumpall" the cluster and load it into a cluster
that was newly created with "initdb".

Any problems loading the data have to be resolved manually.

> I thought this wal trouble was caused by disk IO troubles. But any error was not raised in OS syslog.
> I want to know any other causes.

One cause might be unreliable storage that doesn't honor fsync requests
correctly.  But given that your checkpoint location is pretty late in
the file, it seems unlikely that none of the data did make it to disk.

Somehow you lost the data for a WAL segment, and you cannot recover.

Of course, if you have a backup, you know what to do.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux