On 3/12/24 02:57, Nick Renders wrote:
On 11 Mar 2024, at 16:04, Adrian Klaver wrote:
On 3/11/24 03:11, Nick Renders wrote:
Thank you for your reply Laurenz.
I don't think it is related to any third party security software. We have several other machines with a similar setup, but this is the only server that has this issue.
The one thing different about this machine however, is that it runs 2 instances of Postgres:
- cluster A on port 165
- cluster B on port 164
Cluster A is actually a backup from another Postgres server that is restored on a daily basis via Barman. This means that we login remotely from the Barman server over SSH, stop cluster A's service (port 165), clear the Data folder, restore the latest back into the Data folder, and start up the service again.
Cluster B's Data and service (port 164) remain untouched during all this time. This is the cluster that experiences the intermittent "operation not permitted" issue.
Over the past 2 weeks, I have suspended our restore script and the issue did not occur.
I have just performed another restore on cluster A and now cluster B is throwing errors in the log again.
Since it seems to be the trigger, what are the contents of the restore script?
Any idea why this is happening? It does not occur with every restore, but it seems to be related anyway.
Thanks,
Nick Renders
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx
...how are A and B connected?
The 2 cluster are not connected. They run on the same macOS 14 machine with a single Postgres installation ( /Library/PostgreSQL/16/ ) and their respective Data folders are located on the same volume ( /Volumes/Postgres_Data/PostgreSQL/16/data and /Volumes/Postgres_Data/PostgreSQL/16-DML/data ). Beside that, they run independently on 2 different ports, specified in the postgresql.conf.
...run them under different users on the system.
Are you referring to the "postgres" user / role? Does that also mean setting up 2 postgres installation directories?
...what are the contents of the restore script?
## stop cluster A
ssh postgres@10.0.0.1 '/Library/PostgreSQL/16/bin/pg_ctl -D /Volumes/Postgres_Data/PostgreSQL/16/data stop'
## save config files (ARC_postgresql_16.conf is included in postgresql.conf and contains cluster-specific information like the port number)
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cp ARC_postgresql_16.conf ../ARC_postgresql_16.conf'
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cp pg_hba.conf ../pg_hba.conf'
## clear data directory
ssh postgres@10.0.0.1 'rm -r /Volumes/Postgres_Data/PostgreSQL/16/data/*'
## transfer recovery (this will copy the backup "20240312T040106" and any lingering WAL files into the Data folder)
barman recover --remote-ssh-command 'ssh postgres@10.0.0.1' pg 20240312T040106 /Volumes/Postgres_Data/PostgreSQL/16/data
## restore config files
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cd .. && mv ARC_postgresql_16.conf /Volumes/Postgres_Data/PostgreSQL/16/data/ARC_postgresql_16.conf'
ssh postgres@10.0.0.1 'cd /Volumes/Postgres_Data/PostgreSQL/16/data && cd .. && mv pg_hba.conf /Volumes/Postgres_Data/PostgreSQL/16/data/pg_hba.conf'
## start cluster A
ssh postgres@10.0.0.1 '/Library/PostgreSQL/16/bin/pg_ctl -D /Volumes/Postgres_Data/PostgreSQL/16/data start > /dev/null'
This script runs on a daily basis at 4:30 AM. It did so this morning and there was no issue with cluster B. So even though the issue is most likely related to the script, it does not cause it every time.
I'm not seeing anything obvious, caveat I'm on my first cup of coffee.
From your first post:
2024-02-26 10:29:41.580 CET [63962] FATAL: could not open file
"global/pg_filenode.map": Operation not permitted
2024-02-26 10:30:11.147 CET [90610] LOG: could not open file
"postmaster.pid": Operation not permitted; continuing anyway
For now the only suggestion I have is note the presence, ownership and
privileges of the above files in the present working setup. Then when it
fails do the same and see if there is a difference. My hunch it is in
this step:
barman recover --remote-ssh-command 'ssh postgres@10.0.0.1' pg
20240312T040106 /Volumes/Postgres_Data/PostgreSQL/16/data
If not the step itself then in the process that creates 20240312T040106.
Best regards,
Nick Renders
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx