Hi Laurenz, You said use your backup. Wouldn't it be enough to start instance crash recovery just before the corrupted wal in that case? recovery_target_lsn = <<lsn for 00000001000000150000008F>> recovery_target_inclusive = off No need for a backup. Correct? Cheers, Markus > -----Ursprüngliche Nachricht----- > Von: Laurenz Albe <laurenz.albe@xxxxxxxxxxx> > Gesendet: Montag, 8. März 2021 16:28 > An: Daisuke Ikeda <dai.ikd123@xxxxxxxxx>; pgsql-general@xxxxxxxxxxxxxxxxxxxx > Betreff: [Extern] Re: PostgreSQL container crash trouble. > > On Mon, 2021-03-08 at 22:30 +0900, Daisuke Ikeda wrote: > > I'm running PostgreSQL (enabled timescaledb extension) in the following > environment. > > And I had db pod crashed error several times irregularly. > > > > I want to know the cause of this fail. > > > > Environment: > > > > - Base: Kubernetes(microk8s) > > - DB: PostgreSQL 12.4 (TimescaleDB) Container > > - DB OS: Alpine Linux > > - microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts) > > - DB data location: Mounted host directory (for data persistence) > > > > 1) PostgreSQL crash and cannot start Pod.. > > > > I noticed that the timescaledb pod was restarted repeatedly with the following > error. > > > > --- > > PostgreSQL Database directory appears to contain a database; Skipping > > initialization [1] LOG: starting PostgreSQL 12.4 on > > x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit [1] > > LOG: listening on IPv4 address "0.0.0.0", port 5432 [1] LOG: > > listening on IPv6 address "::", port 5432 [1] LOG: listening on Unix socket > "/var/run/postgresql/.s.PGSQL.5432" > > [20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC > > [20] LOG: invalid primary checkpoint record [20] PANIC: could not > > locate a valid checkpoint record [1] LOG: startup process (PID 20) > > was terminated by signal 6: Aborted [1] LOG: aborting startup due to > > startup process failure [1] LOG: database system is shut down > > --- > > > > I thought it might be WAL trouble and tried to check with pg_controldata and > pg_waldump. > > It is WAL trouble. > WAL does not contain the checkpoint from before the crash. > > > I thought it might be WAL trouble and tried to check with pg_controldata and > pg_waldump. > > > > --- > > $ pg_controldata <PostgreSQL data dir> [...] > > Latest checkpoint location: 15/8FB002C0 > > Latest checkpoint's REDO location: 15/8FAF3018 > > Latest checkpoint's REDO WAL file: 00000001000000150000008F > > [...] > > > > I checked the latest checkpoint's REDO WAL file. > > > > --- > > $ pg_waldump -n 10 pg_wal/00000001000000150000008F > > pg_waldump: fatal: WAL segment size must be a power of two between 1 > > MB and 1 GB, but the WAL file "00000001000000150000008F" header > > specifies 0 bytes > > --- > > > > I cannot read wal data. > > This file size is 16MB (according to the wal size setting). > > But the content is all zero data. I checked this situation with "od" command. > > > > --- > > $ od -N 40 -A d -v pg_wal/00000001000000150000008F > > 0000000 000000 000000 000000 000000 000000 000000 000000 000000 > > 0000016 000000 000000 000000 000000 000000 000000 000000 000000 > > 0000032 000000 000000 000000 000000 > > 0000040 > > --- > > Looks like modifications to this file were lost. > > > 2) Run pg_resetwal and successful start Pod > > > > I ran the pg_resetwal command to repair the WAL trouble. > > And the DB pod successfully started. > > Yes, but "pg_resetwal" on a crashed cluster leads to data corruption. > The best you can do now is salvage what you can. > > > But, I received select query fail for some tables. > > > > 3) failed select query > > > > --- > > sampledb1=# select * from table1 limit 1; > > ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619 > > --- > > > > I thought taht this error is raised for pg_statistic table (and related toast table > (pg_toast_2619)). > > So, I deleted the broken rows in pg_statistic table and ran ANALYZE query. > > After then, this tables is recovered. > > Lucky you! > > > 4) another data trouble > > > > I had an another data trouble after pg_resetwal. > > > > In some table, SELECT query did not return any rows. > > And INSERT query failed (no response. Waiting forever...) This table > > have only primary key index. > > I thought it might have been caused by an pkey index trouble after pg_resetwal. > > I didn't know how to repair the index of the primary key, and finally I dropped > table and restore. > > You should "pg_dumpall" the cluster and load it into a cluster that was newly > created with "initdb". > > Any problems loading the data have to be resolved manually. > > > I thought this wal trouble was caused by disk IO troubles. But any error was not > raised in OS syslog. > > I want to know any other causes. > > One cause might be unreliable storage that doesn't honor fsync requests correctly. > But given that your checkpoint location is pretty late in the file, it seems unlikely > that none of the data did make it to disk. > > Somehow you lost the data for a WAL segment, and you cannot recover. > > Of course, if you have a backup, you know what to do. > > Yours, > Laurenz Albe > -- > Cybertec | https://www.cybertec-postgresql.com > > > > > Achtung: Diese E-Mail wurde von einer externen Adresse verschickt. Klicken Sie > auf keine Links und öffnen Sie keine angehängten Dateien, wenn Sie den Absender > bzw. die Absenderin nicht kennen. Sind Sie sich unsicher, kontaktieren Sie den > Service Desk der Stadt Zürich.