Hello all you guys,
I've sent the same problem in performance list. Some answered me, but didn't resolved the situation.
Since 2 weeks I'm get stucked in a very strange situation: from time to time (sometimes with intervals less than 10 minutes), the server get "stucked"/"hang" (I dont know how to call it) and every connections on postgres (dont matter if it's SELECT, UPDATE, DELETE, INSERT, startup, authentication...) seems like get "paused"; after some seconds (say ~10 or ~15 sec, sometimes less) everything "goes OK".
So, my first trial was to check disks. Running "iostat" apparently showed that disks was OK. It's a Raid10, 4 600GB SAS, IBM Storage DS3512, over FC. IBM DS Storage Manager says that disks is OK.
Then, memory. Apparently no swap being used:
[###@### data]# free -m
total used free shared buffers cached
Mem: 145182 130977 14204 0 43 121407
-/+ buffers/cache: 9526 135655
Swap: 6143 65 6078
No error on /var/log/messages.
Following is what I've tried:
1) Emre Hasegeli has suggested to reduce my shared buffers, but it's already low:
total server memory: 141 GB
shared_buffers: 16 GB
Maybe it's too low? I've been thinking to increase to 32 GB.
max_connections = 500 and ~400 connections average
2) Being "hanging" on "semop" I tried the following, as suggested on some "tuning page" over web. Is it right?
echo "250 32000 200 128" > /proc/sys/kernel/sem
3) I think my problem could be something related to "LwLocks", as I did some googling and found some related problems and slides. There is some way I can confirm this?
4) Rebooting the server didn't make any difference.
Following, is some strace of one process, and some others, maybe, useful infos. Every processes I've straced bring the same scenario: seems it get stucked on semop.
Any help appreciate,
[###@### ~]# strace -ttp 5209
Process 5209 attached - interrupt to quit
09:01:54.122445 semop(2293765, {{15, -1, 0}}, 1) = 0
09:01:55.368785 semop(2293765, {{15, -1, 0}}, 1) = 0
09:01:55.368902 semop(2523148, {{11, 1, 0}}, 1) = 0
09:01:55.368978 semop(2293765, {{15, -1, 0}}, 1) = 0
09:01:55.369861 semop(2293765, {{15, -1, 0}}, 1) = 0
09:01:55.370648 semop(3047452, {{6, 1, 0}}, 1) = 0
09:01:55.370694 semop(2293765, {{15, -1, 0}}, 1) = 0
09:01:55.370762 semop(2785300, {{12, 1, 0}}, 1) = 0
09:01:55.370805 access("base/2048098929", F_OK) = 0
09:01:55.370953 open("base/2048098929/PG_VERSION", O_RDONLY) = 5
[###@### ~]# strace -p 16877 -tt
Process 16877 attached - interrupt to quit
09:57:56.305123 semop(163844, {{13, -1, 0}}, 1) = 0
09:57:59.453714 semop(163844, {{13, -1, 0}}, 1) = 0
09:58:04.004023 semop(163844, {{13, -1, 0}}, 1) = 0
09:58:04.004209 brk(0x1f44000) = 0x1f44000
09:58:04.004305 brk(0x1f42000) = 0x1f42000
[###@### data]# ipcs -l
- Shared Memory Limits -
max number of segments = 4096
max seg size (kbytes) = 83886080
max total shared memory (kbytes) = 17179869184
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 200
semaphore max value = 32767
------ Messages: Limits --------
max queues system wide = 32768
max size of message (bytes) = 65536
default max size of queue (bytes) = 65536
[###@### data]# ipcs -u
----- Semaphore Status -------
used arrays: 34
allocated semaphores: 546
[###@### data]# uname -a
Linux ### 2.6.32-279.14.1.el6.x86_64 #1 SMP Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
postgres=# select version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit
(1 registro)
[###@### data]# cat /etc/redhat-release
CentOS release 6.3 (Final)