Hi guys
The servers are virtual running on Nutanix
We are running Pg version 12 (12.10)
On Linux km-data1.rs.fsbtech.com 5.4.191-1.el7.elrepo.x86_64 #1 SMP Tue Apr 26 12:14:16 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
48G/16 x CPU (Master and slave)
Timeline
System had a number of issues due to kafka and slots, would not always shut down correctly. Several incidents where it had to be forced.
Found corruption on a number of indexes and tables. Decided to recover from a backup (barman). Due to missing wal file restored data up
to about three hours prior to expectation but it's working.
Attempts to build a slave;
On the slave at first we got error messages from pg_basebackup stating that the target directory was not empty;
pg_basebackup: error: directory "/var/lib/pgsql/12/data" exists but is not empty
Although the directory /var/lib/pgsql/12/data was empty (using rm -r ....). Finally deleted the data directory and re-created ensuring that
perms were same. Restore started successfully and completed with error=0.
When starting the instance we got the message;
Job for postgresql-12.service failed because the control process exited with error code. See "systemctl status postgresql-12.service" and "journalctl -xe" for details.
Ran systemctl status postgresql-12.service and it returned;
postgresql-12.service - PostgreSQL 12 database server
Loaded: loaded (/usr/lib/systemd/system/postgresql-12.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2023-10-02 15:26:43 BST; 54s ago
Docs: https://www.postgresql.org/docs/12/static/
Process: 9532 ExecStart=/usr/pgsql-12/bin/postmaster -D ${PGDATA} (code=exited, status=1/FAILURE)
Process: 9526 ExecStartPre=/usr/pgsql-12/bin/postgresql-12-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
Main PID: 9532 (code=exited, status=1/FAILURE)
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9534]: [9-1] user=,db=,app=client= LOG: entering standby mode
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/00000002.history’: No such file or directory
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/0000000200000C740000006D’: No such file or directory
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9532]: [7-1] user=,db=,app=client= LOG: startup process (PID 9534) was terminated by signal 6: Aborted
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9532]: [8-1] user=,db=,app=client= LOG: aborting startup due to startup process failure
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9532]: [9-1] user=,db=,app=client= LOG: database system is shut down
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: postgresql-12.service: main process exited, code=exited, status=1/FAILURE
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: Failed to start PostgreSQL 12 database server.
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: Unit postgresql-12.service entered failed state.
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: postgresql-12.service failed.
Also ran journalctl -xe and it returned;
-- The start-up result is done.
Oct 02 15:28:43 km-data2.rs.fsbtech.com sudo[10001]: pam_unix(sudo:session): session opened for user root by (uid=0)
Oct 02 15:28:43 km-data2.rs.fsbtech.com sudo[10001]: pam_unix(sudo:session): session closed for user root
Oct 02 15:28:43 km-data2.rs.fsbtech.com systemd[1]: Removed slice User Slice of root.
-- Subject: Unit user-0.slice has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-0.slice has finished shutting down.
Oct 02 15:28:43 km-data2.rs.fsbtech.com filebeat[71555]: {"log.level":"info","@timestamp":"2023-10-02T15:28:43.427+0100","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":186},"message
Oct 02 15:28:44 km-data2.rs.fsbtech.com sudo[10016]: zabbix : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/etc/zabbix/scripts/postgres.sh connections sport
Oct 02 15:28:44 km-data2.rs.fsbtech.com systemd[1]: Created slice User Slice of root.
-- Subject: Unit user-0.slice has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-0.slice has finished starting up.
--
-- The start-up result is done.
Oct 02 15:28:44 km-data2.rs.fsbtech.com systemd[1]: Started Session c1058389 of user root.
-- Subject: Unit session-c1058389.scope has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit session-c1058389.scope has finished starting up.
--
-- The start-up result is done.
Oct 02 15:28:44 km-data2.rs.fsbtech.com sudo[10016]: pam_unix(sudo:session): session opened for user root by (uid=0)
Oct 02 15:28:44 km-data2.rs.fsbtech.com sudo[10016]: pam_unix(sudo:session): session closed for user root
Oct 02 15:28:44 km-data2.rs.fsbtech.com systemd[1]: Removed slice User Slice of root.
-- Subject: Unit user-0.slice has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-0.slice has finished shutting down.
The main error seems to be;
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/00000002.history’: No such file or directory
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/0000000200000C740000006D’: No such file or directory
Any ideas guys
The servers are virtual running on Nutanix
We are running Pg version 12 (12.10)
On Linux km-data1.rs.fsbtech.com 5.4.191-1.el7.elrepo.x86_64 #1 SMP Tue Apr 26 12:14:16 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
48G/16 x CPU (Master and slave)
Timeline
System had a number of issues due to kafka and slots, would not always shut down correctly. Several incidents where it had to be forced.
Found corruption on a number of indexes and tables. Decided to recover from a backup (barman). Due to missing wal file restored data up
to about three hours prior to expectation but it's working.
Attempts to build a slave;
On the slave at first we got error messages from pg_basebackup stating that the target directory was not empty;
pg_basebackup: error: directory "/var/lib/pgsql/12/data" exists but is not empty
Although the directory /var/lib/pgsql/12/data was empty (using rm -r ....). Finally deleted the data directory and re-created ensuring that
perms were same. Restore started successfully and completed with error=0.
When starting the instance we got the message;
# systemctl start postgresql-12.service
Ran systemctl status postgresql-12.service and it returned;
postgresql-12.service - PostgreSQL 12 database server
Loaded: loaded (/usr/lib/systemd/system/postgresql-12.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2023-10-02 15:26:43 BST; 54s ago
Docs: https://www.postgresql.org/docs/12/static/
Process: 9532 ExecStart=/usr/pgsql-12/bin/postmaster -D ${PGDATA} (code=exited, status=1/FAILURE)
Process: 9526 ExecStartPre=/usr/pgsql-12/bin/postgresql-12-check-db-dir ${PGDATA} (code=exited, status=0/SUCCESS)
Main PID: 9532 (code=exited, status=1/FAILURE)
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9534]: [9-1] user=,db=,app=client= LOG: entering standby mode
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/00000002.history’: No such file or directory
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/0000000200000C740000006D’: No such file or directory
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9532]: [7-1] user=,db=,app=client= LOG: startup process (PID 9534) was terminated by signal 6: Aborted
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9532]: [8-1] user=,db=,app=client= LOG: aborting startup due to startup process failure
Oct 02 15:26:43 km-data2.rs.fsbtech.com postgres[9532]: [9-1] user=,db=,app=client= LOG: database system is shut down
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: postgresql-12.service: main process exited, code=exited, status=1/FAILURE
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: Failed to start PostgreSQL 12 database server.
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: Unit postgresql-12.service entered failed state.
Oct 02 15:26:43 km-data2.rs.fsbtech.com systemd[1]: postgresql-12.service failed.
Also ran journalctl -xe and it returned;
-- The start-up result is done.
Oct 02 15:28:43 km-data2.rs.fsbtech.com sudo[10001]: pam_unix(sudo:session): session opened for user root by (uid=0)
Oct 02 15:28:43 km-data2.rs.fsbtech.com sudo[10001]: pam_unix(sudo:session): session closed for user root
Oct 02 15:28:43 km-data2.rs.fsbtech.com systemd[1]: Removed slice User Slice of root.
-- Subject: Unit user-0.slice has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-0.slice has finished shutting down.
Oct 02 15:28:43 km-data2.rs.fsbtech.com filebeat[71555]: {"log.level":"info","@timestamp":"2023-10-02T15:28:43.427+0100","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":186},"message
Oct 02 15:28:44 km-data2.rs.fsbtech.com sudo[10016]: zabbix : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/etc/zabbix/scripts/postgres.sh connections sport
Oct 02 15:28:44 km-data2.rs.fsbtech.com systemd[1]: Created slice User Slice of root.
-- Subject: Unit user-0.slice has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-0.slice has finished starting up.
--
-- The start-up result is done.
Oct 02 15:28:44 km-data2.rs.fsbtech.com systemd[1]: Started Session c1058389 of user root.
-- Subject: Unit session-c1058389.scope has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit session-c1058389.scope has finished starting up.
--
-- The start-up result is done.
Oct 02 15:28:44 km-data2.rs.fsbtech.com sudo[10016]: pam_unix(sudo:session): session opened for user root by (uid=0)
Oct 02 15:28:44 km-data2.rs.fsbtech.com sudo[10016]: pam_unix(sudo:session): session closed for user root
Oct 02 15:28:44 km-data2.rs.fsbtech.com systemd[1]: Removed slice User Slice of root.
-- Subject: Unit user-0.slice has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit user-0.slice has finished shutting down.
The main error seems to be;
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/00000002.history’: No such file or directory
Oct 02 15:26:43 km-data2.rs.fsbtech.com postmaster[9532]: cp: cannot stat ‘barman_wal/0000000200000C740000006D’: No such file or directory
Any ideas guys
Doug Reed
dougreed765@xxxxxxxxx
07973-132664
https://uk.linkedin.com/pub/douglas-reed/33/326/2b