Search Postgresql Archives

Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Tue, 22 Jun 2021 at 14:34, Vijaykumar Jain <vijaykumarjain.github@xxxxxxxxx> wrote:

On Tue, 22 Jun 2021 at 13:32, Mike Yeap <wkk1020@xxxxxxxxx> wrote:
Hi all,

I have a Postgres version 11.11 configured with both physical replication slots (for repmgr) as well as some logical replication slots (for AWS Database Migration Service (DMS)). This morning, the server went panic with the following messages found in the log file:

2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]" user_name=dms database=** host(port)=**(48360)] PANIC:  could not open file "pg_logical/snapshots/969-FD606138.snap": Operation not permitted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  server process (PID 19457) was terminated by signal 6: Aborted

2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database= host(port)=] LOG:  terminating any other active server processes


I just tried the below case, when a running logical replication is denied access to the snapshot folder via chattr +i  to make the pg crash at publisher.
Hence I was speculating, something was occured at the filesystem, this may not be related, but just trying to get a scenario to simulate a repeatable crash.

A (port 5001 publisher) -> logical replication all tables -> B(port 5002 subscriber), all working fine.

postgres@db:~/playground/logical_replication$ psql -p 5001 -c 'select count(1) from t;'
 count
-------
  1000
(1 row)

postgres@db:~/playground/logical_replication$ psql -p 5002 -c 'select count(1) from t;'
 count
-------
  1000
(1 row)

now i change file attributes of pg_logical folder  at the publisher

root@db:/home/postgres/playground/logical_replication/db1/pg_logical# cd /home/postgres/playground/logical_replication/db1/pg_logical
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
--------------e----- ./mappings
--------------e----- ./snapshots
--------------e----- ./replorigin_checkpoint
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# chattr -R +i *   # do not allow mod
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
----i---------e----- ./mappings
----i---------e----- ./snapshots
----i---------e----- ./replorigin_checkpoint

 psql -p 5001 -c 'delete from t; checkpoint;'  # and crash

2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG:  could not remove file "pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.003 IST [2899] PANIC:  could not open file "pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.092 IST [2897] LOG:  checkpointer process (PID 2899) was terminated by signal 6: Aborted
2021-06-24 00:22:37.092 IST [2897] LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
2021-06-24 00:22:37.093 IST [2897] LOG:  all server processes terminated; reinitializing
postgres@db:~/playground/logical_replication$ 2021-06-24 00:22:37.100 IST [2920] LOG:  database system was interrupted; last known up at 2021-06-24 00:20:54 IST
2021-06-24 00:22:37.100 IST [2921] FATAL:  the database system is in recovery mode
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16EF1E0.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG:  could not open file "./pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.212 IST [2920] LOG:  database system was not properly shut down; automatic recovery in progress
2021-06-24 00:22:37.214 IST [2920] LOG:  redo starts at 0/16E72B8
2021-06-24 00:22:37.214 IST [2920] LOG:  invalid record length at 0/16EF218: wanted 24, got 0
2021-06-24 00:22:37.214 IST [2920] LOG:  redo done at 0/16EF1E0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG:  could not remove file "pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.219 IST [2920] PANIC:  could not open file "pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.309 IST [2897] LOG:  startup process (PID 2920) was terminated by signal 6: Aborted
2021-06-24 00:22:37.309 IST [2897] LOG:  aborting startup due to startup process failure
2021-06-24 00:22:37.310 IST [2897] LOG:  database system is shut down




*************************
now i restore the attr back for pg_logical and restart the db

root@db:/home/postgres/playground/logical_replication/db1/pg_logical# chattr -R -i *
root@db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
--------------e----- ./mappings
--------------e----- ./replorigin_checkpoint.tmp
--------------e----- ./snapshots
--------------e----- ./replorigin_checkpoint


2021-06-24 00:24:00.026 IST [2947] LOG:  database system is ready to accept connections
 done
server started
postgres@db:~/playground/logical_replication$ 2021-06-24 00:24:02.357 IST [2957] LOG:  starting logical decoding for slot "mysub"
2021-06-24 00:24:02.357 IST [2957] DETAIL:  Streaming transactions committing after 0/16EF1E0, reading WAL from 0/16EF1E0.
2021-06-24 00:24:02.357 IST [2957] STATEMENT:  START_REPLICATION SLOT "mysub" LOGICAL 0/16E7398 (proto_version '2', publication_names '"mypub"')
2021-06-24 00:24:02.357 IST [2957] LOG:  logical decoding found consistent point at 0/16EF1E0
2021-06-24 00:24:02.357 IST [2957] DETAIL:  There are no running transactions.
2021-06-24 00:24:02.357 IST [2957] STATEMENT:  START_REPLICATION SLOT "mysub" LOGICAL 0/16E7398 (proto_version '2', publication_names '"mypub"')
 
I might be wrong, but if you can put up monitoring around your underlying storage, that may help.
I do not know how it works on the cloud, but I used to do it over the hypervisor layer on vmware and emit metrics for the same to relate errors with crash.

--
Thanks,
Vijay
Mumbai, India

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux