We're seeing the FATAL error message from the subject pop up in our logs at regular intervals, but I haven't been able to pinpoint what is causing it. I'm hoping for some insights here.
We run a PostgreSQL 11.9 server on CentOS 7, within a vmware environment:
PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
The package was installed from the PGDG repository.
I'm not even sure I should be worried, there doesn't appear to be any impact on the servers' functioning, but it does say 'FATAL'.
What we're seeing are lines like these two instances:
2020-09-30 21:45:09.999 CEST [2375] 172.30.2.25 asdf STATEMENT: select count(*) from "util_asdf"."v_something_something2"
2020-09-30 21:45:15.018 CEST [2375] 172.30.2.25 asdf ERROR: canceling statement due to user request
2020-09-30 21:45:15.018 CEST [2375] 172.30.2.25 asdf STATEMENT: select count(*) from "dm_asdf"."asdf_func"
2020-09-30 21:45:20.027 CEST [2375] 172.30.2.25 asdf ERROR: canceling statement due to user request
2020-09-30 21:45:20.027 CEST [2375] 172.30.2.25 asdf STATEMENT: select count(*) from "dm_asdf"."asdf_func"
2020-09-30 22:27:51.422 CEST [20270] 172.30.2.26 selfservice_sales ERROR: canceling statement due to user request
2020-09-30 22:27:51.422 CEST [20270] 172.30.2.26 selfservice_sales STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:51.422 CEST [30649] FATAL: terminating connection due to administrator command
2020-09-30 22:27:51.422 CEST [30649] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:51.422 CEST [30648] FATAL: terminating connection due to administrator command
2020-09-30 22:27:51.422 CEST [30648] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:51.422 CEST [30647] FATAL: terminating connection due to administrator command
2020-09-30 22:27:51.422 CEST [30647] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:51.422 CEST [30646] FATAL: terminating connection due to administrator command
2020-09-30 22:27:51.422 CEST [30646] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:51.422 CEST [30645] FATAL: terminating connection due to administrator command
2020-09-30 22:27:51.422 CEST [30645] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:51.435 CEST [6482] LOG: background worker "parallel worker" (PID 30646) exited with exit code 1
2020-09-30 22:27:51.435 CEST [6482] LOG: background worker "parallel worker" (PID 30645) exited with exit code 1
2020-09-30 22:27:51.435 CEST [6482] LOG: background worker "parallel worker" (PID 30647) exited with exit code 1
2020-09-30 22:27:51.435 CEST [6482] LOG: background worker "parallel worker" (PID 30648) exited with exit code 1
2020-09-30 22:27:51.435 CEST [6482] LOG: background worker "parallel worker" (PID 30649) exited with exit code 1
2020-09-30 22:27:56.446 CEST [20270] 172.30.2.26 selfservice_sales ERROR: canceling statement due to user request
2020-09-30 22:27:56.446 CEST [20270] 172.30.2.26 selfservice_sales STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:56.446 CEST [30659] FATAL: terminating connection due to administrator command
2020-09-30 22:27:56.446 CEST [30659] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:56.446 CEST [30658] FATAL: terminating connection due to administrator command
2020-09-30 22:27:56.446 CEST [30658] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:56.446 CEST [30657] FATAL: terminating connection due to administrator command
2020-09-30 22:27:56.446 CEST [30657] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:56.446 CEST [30656] FATAL: terminating connection due to administrator command
2020-09-30 22:27:56.446 CEST [30656] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:56.446 CEST [30655] FATAL: terminating connection due to administrator command
2020-09-30 22:27:56.446 CEST [30655] STATEMENT: select count(*) from "dm_b2b"."prlwytzkofskiv1"
2020-09-30 22:27:56.459 CEST [6482] LOG: background worker "parallel worker" (PID 30655) exited with exit code 1
2020-09-30 22:27:56.459 CEST [6482] LOG: background worker "parallel worker" (PID 30656) exited with exit code 1
2020-09-30 22:27:56.459 CEST [6482] LOG: background worker "parallel worker" (PID 30657) exited with exit code 1
2020-09-30 22:27:56.459 CEST [6482] LOG: background worker "parallel worker" (PID 30658) exited with exit code 1
2020-09-30 22:27:56.459 CEST [6482] LOG: background worker "parallel worker" (PID 30659) exited with exit code 1
2020-09-30 22:43:08.459 CEST [8055] 172.30.2.25 selfservice_prd ERROR: schema "somethingelse" does not exist at character 71
Apparently, something is sending SIGTERM to our pg processes. I know that I'm not doing that, certainly not at those hours, and I'm the one who set up this system and am the only DBA of it.
Advice I found on the Internet is to use systemtap with some tap-script, but the scripts that I found just displayed the PID's of processes without telling me their names, which I didn't find all that useful in figuring out who was responsible, so I made an attempt (I have no experience with stap) at modifying it to print process names of signal sender and target:
/*
* killsnoop-nd.stp Trace process signals.
* For Linux, uses SystemTap (non-debuginfo).
*
* Copyright (C) 2015 Brendan Gregg.
(etc)
*/
global target;
global signal;
probe begin
{
printf("%-6s %-12s %-5s %-6s %6s\n", "FROM", "COMMAND", "SIG", "TO", "COMMAND");
}
probe nd_syscall.kill
{
target[tid()] = uint_arg(1);
signal[tid()] = uint_arg(2);
}
probe nd_syscall.kill.return
{
if (signal[tid()] == 15 && target[tid()] != 0) {
printf("%-6d %-12s %-5d %-6d %12s\n"
, pid(), execname()
, signal[tid()]
, target[tid()], pid2execname(target[tid()]));
}
delete target[tid()];
delete signal[tid()];
}
The output of last night was:
FROM COMMAND SIG TO COMMAND
30068 systemd-udevd 15 14151 systemd-udevd
30068 systemd-udevd 15 14836 systemd-udevd
(...)
6482 postmaster 15 30649 postmaster
6482 postmaster 15 30648 postmaster
6482 postmaster 15 30647 postmaster
6482 postmaster 15 30646 postmaster
6482 postmaster 15 30645 postmaster
6482 postmaster 15 30659 postmaster
6482 postmaster 15 30658 postmaster
6482 postmaster 15 30657 postmaster
6482 postmaster 15 30656 postmaster
6482 postmaster 15 30655 postmaster
6482 postmaster 15 2065 postmaster
6482 postmaster 15 2064 postmaster
6482 postmaster 15 2063 postmaster
Several of these TO-pid's match those in the PG log.
$ ps aux | grep 6482
postgres 6482 0.0 0.5 51755776 1043872 ? Ss Sep28 3:25 /usr/pgsql-11/bin/postmaster -D /data/11/data
Am I correct in concluding that postmaster is killing its own processes? If so, what is going on here? And more importantly, what do I do about it?
Regards,
Alban Hertroys.
P.S. I'm mailing from my work account, as this is work related. You may remember me from my private account.
Alban Hertroys
D: +31 (0)53 4 888 888 | T: +31 (0)53 4888 888 | E: alban.hertroys@xxxxxxxxxxxxxxxxxxxx
Apollo Vredestein B.V.| Ir. E.L.C. Schiffstraat 370, 7547 RD Enschede, The Netherlands
Chamber of Commerce number: 34223268
|
The information contained in this e-mail is intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or action in relation to the contents of this information is strictly prohibited and may be unlawful and request you to delete this message and any attachments and advise the sender by return e-mail. The confidentiality of this message is not warranted. Apollo Vredestein and its subsidiaries rule out any and every liability resulting from this or any other electronic transmission
Please consider the environment before printing this e-mail |