Dear all.
I'm running PostgreSQL (enabled timescaledb extension) in the following environment.
And I had db pod crashed error several times irregularly.
I want to know the cause of this fail.
Environment:
- Base: Kubernetes(microk8s)
- DB: PostgreSQL 12.4 (TimescaleDB) Container
- DB OS: Alpine Linux
- microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts)
- DB data location: Mounted host directory (for data persistence)
1) PostgreSQL crash and cannot start Pod..
I noticed that the timescaledb pod was restarted repeatedly with the following error.
---
PostgreSQL Database directory appears to contain a database; Skipping initialization
[1] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit
[1] LOG: listening on IPv4 address "0.0.0.0", port 5432
[1] LOG: listening on IPv6 address "::", port 5432
[1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
[20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC
[20] LOG: invalid primary checkpoint record
[20] PANIC: could not locate a valid checkpoint record
[1] LOG: startup process (PID 20) was terminated by signal 6: Aborted
[1] LOG: aborting startup due to startup process failure
[1] LOG: database system is shut down
---
I thought it might be WAL trouble and tried to check with pg_controldata and pg_waldump.
---
$ pg_controldata <PostgreSQL data dir>
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6909006008117780509
Database cluster state: in production
pg_control last modified: Mon Feb 15 21:11:21 2021
Latest checkpoint location: 15/8FB002C0
Latest checkpoint's REDO location: 15/8FAF3018
Latest checkpoint's REDO WAL file: 00000001000000150000008F
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:33988723
Latest checkpoint's NextOID: 117637
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 33988723
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Mon Feb 15 21:11:14 2021
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: off
max_connections setting: 50
max_worker_processes setting: 15
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 871e705d9393cdb1b161d5562656e20db314779af25942fa10d840983dc4ba84
---
I checked the latest checkpoint's REDO WAL file.
---
$ pg_waldump -n 10 pg_wal/00000001000000150000008F
pg_waldump: fatal: WAL segment size must be a power of two between 1 MB and 1 GB, but the WAL file "00000001000000150000008F" header specifies 0 bytes
---
I cannot read wal data.
This file size is 16MB (according to the wal size setting).
But the content is all zero data. I checked this situation with "od" command.
---
$ od -N 40 -A d -v pg_wal/00000001000000150000008F
0000000 000000 000000 000000 000000 000000 000000 000000 000000
0000016 000000 000000 000000 000000 000000 000000 000000 000000
0000032 000000 000000 000000 000000
0000040
---
The WAL log of the previous one could be seen with pg_waldump.
However, I could not find any log corresponding to the latest checkpoint location (8FB002C0) listed in pg_controldata.
2) Run pg_resetwal and successful start Pod
I ran the pg_resetwal command to repair the WAL trouble.
And the DB pod successfully started.
But, I received select query fail for some tables.
3) failed select query
---
sampledb1=# select * from table1 limit 1;
ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619
---
I thought taht this error is raised for pg_statistic table (and related toast table (pg_toast_2619)).
So, I deleted the broken rows in pg_statistic table and ran ANALYZE query.
After then, this tables is recovered.
4) another data trouble
I had an another data trouble after pg_resetwal.
In some table, SELECT query did not return any rows.
And INSERT query failed (no response. Waiting forever...)
This table have only primary key index.
I thought it might have been caused by an pkey index trouble after pg_resetwal.
I didn't know how to repair the index of the primary key, and finally I dropped table and restore.
I thought this wal trouble was caused by disk IO troubles. But any error was not raised in OS syslog.
I want to know any other causes.
Beat regards.
I'm running PostgreSQL (enabled timescaledb extension) in the following environment.
And I had db pod crashed error several times irregularly.
I want to know the cause of this fail.
Environment:
- Base: Kubernetes(microk8s)
- DB: PostgreSQL 12.4 (TimescaleDB) Container
- DB OS: Alpine Linux
- microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts)
- DB data location: Mounted host directory (for data persistence)
1) PostgreSQL crash and cannot start Pod..
I noticed that the timescaledb pod was restarted repeatedly with the following error.
---
PostgreSQL Database directory appears to contain a database; Skipping initialization
[1] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit
[1] LOG: listening on IPv4 address "0.0.0.0", port 5432
[1] LOG: listening on IPv6 address "::", port 5432
[1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
[20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC
[20] LOG: invalid primary checkpoint record
[20] PANIC: could not locate a valid checkpoint record
[1] LOG: startup process (PID 20) was terminated by signal 6: Aborted
[1] LOG: aborting startup due to startup process failure
[1] LOG: database system is shut down
---
I thought it might be WAL trouble and tried to check with pg_controldata and pg_waldump.
---
$ pg_controldata <PostgreSQL data dir>
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6909006008117780509
Database cluster state: in production
pg_control last modified: Mon Feb 15 21:11:21 2021
Latest checkpoint location: 15/8FB002C0
Latest checkpoint's REDO location: 15/8FAF3018
Latest checkpoint's REDO WAL file: 00000001000000150000008F
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:33988723
Latest checkpoint's NextOID: 117637
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 33988723
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Mon Feb 15 21:11:14 2021
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: off
max_connections setting: 50
max_worker_processes setting: 15
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 871e705d9393cdb1b161d5562656e20db314779af25942fa10d840983dc4ba84
---
I checked the latest checkpoint's REDO WAL file.
---
$ pg_waldump -n 10 pg_wal/00000001000000150000008F
pg_waldump: fatal: WAL segment size must be a power of two between 1 MB and 1 GB, but the WAL file "00000001000000150000008F" header specifies 0 bytes
---
I cannot read wal data.
This file size is 16MB (according to the wal size setting).
But the content is all zero data. I checked this situation with "od" command.
---
$ od -N 40 -A d -v pg_wal/00000001000000150000008F
0000000 000000 000000 000000 000000 000000 000000 000000 000000
0000016 000000 000000 000000 000000 000000 000000 000000 000000
0000032 000000 000000 000000 000000
0000040
---
The WAL log of the previous one could be seen with pg_waldump.
However, I could not find any log corresponding to the latest checkpoint location (8FB002C0) listed in pg_controldata.
2) Run pg_resetwal and successful start Pod
I ran the pg_resetwal command to repair the WAL trouble.
And the DB pod successfully started.
But, I received select query fail for some tables.
3) failed select query
---
sampledb1=# select * from table1 limit 1;
ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619
---
I thought taht this error is raised for pg_statistic table (and related toast table (pg_toast_2619)).
So, I deleted the broken rows in pg_statistic table and ran ANALYZE query.
After then, this tables is recovered.
4) another data trouble
I had an another data trouble after pg_resetwal.
In some table, SELECT query did not return any rows.
And INSERT query failed (no response. Waiting forever...)
This table have only primary key index.
I thought it might have been caused by an pkey index trouble after pg_resetwal.
I didn't know how to repair the index of the primary key, and finally I dropped table and restore.
I thought this wal trouble was caused by disk IO troubles. But any error was not raised in OS syslog.
I want to know any other causes.
Beat regards.