After terminating session file has been updated with new lines.
***content from the previous letter***
***That one unfinished line. Freeze for ~6 hours.***
epoll_wait(4, 0x555d738e4f30, 1, -1) = -1 EINTR (Interrupted system call) <21328.371590>
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=2340453, si_uid=2001} ---
kill(2292604, SIGURG) = 0 <0.000056>
rt_sigreturn({mask=[URG]}) = -1 EINTR (Interrupted system call) <0.000027>
write(2, "\0\0;\1|\373\"\0t2024-03-27 18:43:56 MSK"..., 324) = 324 <0.000027>
rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0 <0.000018>
close(64) = 0 <0.000041>
close(218) = 0 <0.000029>
close(220) = 0 <0.000027>
close(219) = 0 <0.000023>
unlink("base/121442067/t4_182045966") = 0 <0.000520>
unlink("base/121442067/t4_182045966.1") = -1 ENOENT (No such file or directory) <0.000087>
unlink("base/121442067/t4_182045966_fsm") = -1 ENOENT (No such file or directory) <0.000060>
unlink("base/121442067/t4_182045966_vm") = -1 ENOENT (No such file or directory) <0.000053>
unlink("base/121442067/t4_182045966_init") = -1 ENOENT (No such file or directory) <0.000061>
unlink("base/121442067/t4_182045963") = 0 <0.000454>
***a lot of unlink***
unlink("base/121442067/t4_182045892") = 0 <0.000046>
unlink("base/121442067/t4_182045892.1") = -1 ENOENT (No such file or directory) <0.000027>
unlink("base/121442067/t4_182045892_fsm") = -1 ENOENT (No such file or directory) <0.000023>
unlink("base/121442067/t4_182045892_vm") = -1 ENOENT (No such file or directory) <0.000021>
unlink("base/121442067/t4_182045892_init") = -1 ENOENT (No such file or directory) <0.000027>
close(407) = 0 <0.000032>
stat("base/pgsql_tmp/pgsql_tmp2292604.0", {st_mode=S_IFREG|0600, st_size=19141117, ...}) = 0 <0.000024>
unlink("base/pgsql_tmp/pgsql_tmp2292604.0") = 0 <0.010861>
sendto(9, "\24\0\0\0\30\0\0\0\23\17=\7]U\0\0\375\21$\1\0\0\0\0", 24, 0, NULL, 0) = 24 <0.000067>
write(2, "\0\0\265\0|\373\"\0t2024-03-27 18:43:56 MSK"..., 190) = 190 <0.000030>
brk(0x555d798fa000) = 0x555d798fa000 <0.002482>
brk(0x555d78902000) = 0x555d78902000 <0.003686>
brk(0x555d76f75000) = 0x555d76f75000 <0.008538>
openat(AT_FDCWD, "global/pg_filenode.map", O_RDONLY) = 64 <0.000043>
read(64, "\27'Y\0+\0\0\0\356\4\0\0r\316Y\10\224\v\0\0y\316Y\10\275\4\0\0\177\316Y\10"..., 512) = 512 <0.000022>
close(64) = 0 <0.000019>
openat(AT_FDCWD, "base/121442067/pg_filenode.map", O_RDONLY) = 64 <0.000027>
read(64, "\27'Y\0\21\0\0\0\353\4\0\0\3\26?\10\341\4\0\0\334\32?\10\347\4\0\0\347\4\0\0"..., 512) = 512 <0.000018>
close(64) = 0 <0.000018>
close(20) = 0 <0.000019>
***a lot of close***
close(138) = 0 <0.000019>
openat(AT_FDCWD, "base/121442067/138352131", O_RDWR) = 5 <0.000052>
lseek(5, 0, SEEK_END) = 149766144 <0.000018>
openat(AT_FDCWD, "base/121442067/2601", O_RDWR) = 6 <0.000029>
lseek(6, 0, SEEK_END) = 8192 <0.000017>
kill(21, SIGURG) = 0 <0.000035>
sendto(9, "\31\0\0\0\20\0\0\0\23\17=\7\4\0\0\0", 16, 0, NULL, 0) = 16 <0.000078>
sendto(9, "\2\0\0\0\300\3\0\0\23\17=\7\10\0\0\0\3\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 960, 0, NULL, 0) = 960 <0.000036>
sendto(9, "\2\0\0\0\300\3\0\0\23\17=\7\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 960, 0, NULL, 0) = 960 <0.000041>
***a lot of sendto***
sendto(9, "\20\0\0\0H\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r\0\0\0\0\0\0\0"..., 72, 0, NULL, 0) = 72 <0.000030>
write(2, "\0\0\322\0|\373\"\0t2024-03-27 18:43:56 MSK"..., 219) = 219 <0.000050>
futex(0x7fd5aacbf6c4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000020>
exit_group(1) = ?
+++ exited with 1 +++
PS: I guess "No such file or directory" messages due to execution of strace on host, and postgres process is executing inside docker.
Среда, 27 марта 2024, 18:39 +03:00 от Ц <pfunk@xxxxxxx>:
I was able to connect to this process usingstrace -T -p `pgrep -n -u ks-postgres -f "ocb.*FETCH"` -o strace_of_active_session -ff(the -T flag gives us the time the system call has been running at the end of each line).If the session had been hanging active for some time, there was only one unfinished line in strace:epoll_wait(4,But if we got it a bit earlier, we got this output:***something before process was catched***epoll_wait(4, [{EPOLLOUT, {u32=1938706152, u64=93859859025640}}], 1, -1) = 1 <0.003515>sendto(10, "1.211 \320\224\320\232Z200.21121002 \320\240\320\232211)"..., 1968, 0, NULL, 0) = 1968 <0.000068>sendto(10, " \320\277\320\276\320\273 \321\204\320\265\320\262\321\200\320\260\320\273\321\214 \320\235\320\224\320\241 \320\275"..., 8192, 0, NULL, 0) = 8192 <0.000051>pread64(407, "y\221\275\0\0\0\0\0\0\342\275[\0\325\320\244\320\232\320\243 \320\220\320\264\320\274\320\270\320\275\320"..., 8192, 10354688) = 8192 <0.000065>sendto(10, "\377\377\0\0\0\270(907.0702.1340380180.111.2"..., 8192, 0, NULL, 0) = 8192 <0.000066>pread64(407, "00057476 \320\276\321\202 31.01.2024 \320\262 \321\202.\321"..., 8192, 10362880) = 8192 <0.000044>sendto(10, "7233430 \320\225\320\264\320\270\320\275\321\213\320\271 \320\275\320\260\320\273\320\276\320\263\320"..., 8192, 0, NULL, 0) = 8192 <0.000073>***a lot of pread64, sendto***sendto(10, "010006300\377\377\377\377\0\0\0X\320\241\320\274\320\276\320\273\320\265\320\275\321\201\320"..., 8192, 0, NULL, 0) = 8192 <0.000023>pread64(407, "901001\10\2\0\0\30\0\3\0 y\221\235\0\0\0\0\0\0\34\202[\0\20\2\0\0"..., 8192, 16949248) = 8192 <0.000020>sendto(10, "\321\200\320\265\320\265\321\201\321\202\321\200\321\203 \342\204\226 24119 \320\276\321\202 19"..., 8192, 0, NULL, 0) = 8192 <0.000024>sendto(10, ") \320\273/\321\20120907234120 \320\221\320\23690741220"..., 8192, 0, NULL, 0) = 8192 <0.000022>pread64(407, "\320\275\321\201\320\272\320\260 (\320\243\320\277\321\200\320\260\320\262\320\273\320\265\320\275\320\270\320\265 \320"..., 8192, 16957440) = 8192 <0.000019>sendto(10, "632\0\0\0\n7707083893\0\0\0\0\0\0\0\t6730020"..., 8192, 0, NULL, 0) = 8192 <0.000022>pread64(407, "\1\0\0\30\0\3\0 y\221\275\0\0\0\0\0\0\356\367[\0\375\320\244\320\232\320\243 \320\220\320"..., 8192, 16965632) = 8192 <0.000020>sendto(10, "\0\0\0\02403234643667010006300\377\377\377\377\0\0\0d"..., 8192, 0, NULL, 0) = 6632 <0.000026>sendto(10, "\320\260\321\206\320\270\320\270 \320\263\320\276\321\200\320\276\320\264\320\260 \320\241\320\274\320\276\320\273\320\265"..., 1560, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>epoll_wait(4, 0x555d738e4f30, 1, -1) = -1 EINTR (Interrupted system call) <2.855325>--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=2292607, si_uid=2001} ---kill(2292604, SIGURG) = 0 <0.000023>rt_sigreturn({mask=[URG]}) = -1 EINTR (Interrupted system call) <0.000052>sendto(10, "\320\260\321\206\320\270\320\270 \320\263\320\276\321\200\320\276\320\264\320\260 \320\241\320\274\320\276\320\273\320\265"..., 1560, 0, NULL, 0) = 1560 <0.000033>pread64(407, "\320\265\321\202.\23\1\202\4\0J\6\210\23\025044525411\0276732208"..., 8192, 16973824) = 8192 <0.000026>sendto(10, "2.24,4085 \320\276\321\202 21.02.24 \320\272-\321\202 01"..., 8192, 0, NULL, 0) = 8192 <0.000025>pread64(407, "\276\320\263. 59003491 \320\276\321\202 16.11.2007 \320\235"..., 8192, 16982016) = 8192 <0.000020>***a lot of pread64, sendto***pread64(407, "\260 (\320\234\320\221\320\236\320\243 \"\320\241\320\250 \342\204\226 11\")+40702"..., 8192, 17375232) = 8192 <0.000021>sendto(10, "\274\320\276\320\273\320\265\320\275\321\201\320\272\320\276\320\265 \320\276\321\202\320\264\320\265\320\273\320\265\320\275"..., 8192, 0, NULL, 0) = 4000 <0.000025>sendto(10, "\n6731030957\0\0\0\0\0\0\0\t673001001\377\377\377\377"..., 4192, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>epoll_wait(4, [{EPOLLIN, {u32=1938706176, u64=93859859025664}}], 1, -1) = 1 <0.000018>read(11, "\27\0\0\0\0\0\0\0\0\0\0\0|\373\"\0\321\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 <0.000020>epoll_wait(4, 0x555d738e4f30, 1, -1) = -1 EINTR (Interrupted system call) <4.214523>--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---kill(2292604, SIGURG) = 0 <0.000028>setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=1790, tv_usec=905739}}, NULL) = 0 <0.000020>rt_sigreturn({mask=[URG]}) = -1 EINTR (Interrupted system call) <0.000018>sendto(10, "\n6731030957\0\0\0\0\0\0\0\t673001001\377\377\377\377"..., 4192, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000023>epoll_wait(4, [{EPOLLIN, {u32=1938706176, u64=93859859025664}}], 1, -1) = 1 <0.000017>read(11, "\27\0\0\0\0\0\0\0\0\0\0\0|\373\"\0\321\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 <0.000021>***First hang in strace output for <1790.905479> seconds******It was looking like last line in this file******epoll_wait(4, ***epoll_wait(4, 0x555d738e4f30, 1, -1) = -1 EINTR (Interrupted system call) <1790.905479>--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---kill(2292604, SIGURG) = 0 <0.000035>kill(-2292604, SIGINT) = 0 <0.000020>--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=2292604, si_uid=2001} ---rt_sigreturn({mask=[ALRM URG]}) = 0 <0.000018>kill(2292604, SIGINT) = 0 <0.000017>--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=2292604, si_uid=2001} ---rt_sigreturn({mask=[ALRM URG]}) = 0 <0.000016>rt_sigreturn({mask=[URG]}) = -1 EINTR (Interrupted system call) <0.000016>sendto(10, "\n6731030957\0\0\0\0\0\0\0\t673001001\377\377\377\377"..., 4192, 0, NULL, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000025>epoll_wait(4, [{EPOLLIN, {u32=1938706176, u64=93859859025664}}], 1, -1) = 1 <0.000020>read(11, "\27\0\0\0\0\0\0\0\0\0\0\0|\373\"\0\321\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 128 <0.000018>epoll_wait(4,***last line looks incomplete and it hangs forever***When I first noticed that the file was not growing, the last line was the same incomplete oneepoll_wait(4 ,and I thought that nothing would change.But later I noticed that the file had changed.The line was formed in:epoll_wait(4, 0x555d738e4f30, 1, -1) = -1 EINTR (Interrupted system call) <1790.905479>telling us that epoll_wait had been hanging for 30 minutes.An there were added 13 more lines with the same ending epoll_wait(4,...but it was already final.Here are the values of the possibly involved parameters:name | setting | unit-------------------------------------+---------+------deadlock_timeout | 1000 | msidle_in_transaction_session_timeout | 30000 | msidle_session_timeout | 60000 | msstatement_timeout | 1800000 | mstcp_keepalives_count | 4 |tcp_keepalives_idle | 30 | stcp_keepalives_interval | 5 | stcp_user_timeout | 0 | msPS: found an issue in node-postgres with similar symptoms but can't figure out how their solution can be applied in my case.Вторник, 26 марта 2024, 17:19 +03:00 от Ц <pfunk@xxxxxxx>:
Greetings!I’ve faced with strange behavior when I see a lot of active sessions started hours ago while statement_timeout = '30min'.All of them are fetching from cursors.Typical session looks like:backend_start | 2024-03-26 14:34:20.552594+03xact_start | 2024-03-26 14:34:54.974628+03query_start | 2024-03-26 14:35:02.024133+03state_change | 2024-03-26 14:35:02.024134+03wait_event_type | Clientwait_event | ClientWritestate | activebackend_xid | 23240392backend_xmin | 23226474query | fetch all from "<unnamed portal 20>"backend_type | client backendThey are accumulating up to tens by the end of the day with all negative impacts on performance.Initially I thought that clients already died but due to network issues database considers them to be alive. So I set tcp_keepalive GUCs to nonzero values. Without success.Then I checked connections from the app server side and found them in ESTABLISHED state.It's certainly an application fault and it should not hold cursor forever...butIs the any GUC parameters to fight with such «clients»?