Hello Postgresql experts,
We are facing issues with our PostgreSQL databases running on Ubuntu server, right after we shifted our databases from OpenSuse O/S.
It's a new database servers runs fine for most of the time (Avg. Load 0.5 to 1.0) but suddenly spikes once/twice a day.This happens four times in last three day and during this, simple update/select statements started taking minutes (1 to 5 Minutes) instead of 5-50 mSec.
And this max out database 250 connections. This event halt all processes for about 15- 20 min and then everything back to normal. I verified checkpoint and vacuum related activities but this isn't showing any problem to me. (attached logs)
Top/vmstat output shows all resources were suddenly utilized by %us during same time. iostat doesn't shows any IO related bottleneck. I have added completed logs for yesterday outage (13:45 to 14:15) .
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
44 0 0 201929344 345260 50775772 0 0 2 15 2 2 2 0 98 0
40 0 0 201919264 345260 50775864 0 0 0 224 9409 1663 98 1 1 0
40 0 0 201915344 345260 50775880 0 0 0 280 8803 1674 99 0 0 0
38 0 0 201911296 345260 50775888 0 0 0 156 8753 1469 99 0 0 0
40 0 0 201902416 345260 50775888 0 0 0 224 9060 2775 98 1 1 0
Free -m
total used free shared buffers cached
Mem: 251 59 192 0 0 48
-/+ buffers/cache: 10 241
Swap: 29 0 29
System information.
Connections into our databases are coming from WebServer (running on PHP and Apache) and script servers (PHP).We have verified apache logs and we didn't find connection traffic during same interval.
We are facing issues with our PostgreSQL databases running on Ubuntu server, right after we shifted our databases from OpenSuse O/S.
It's a new database servers runs fine for most of the time (Avg. Load 0.5 to 1.0) but suddenly spikes once/twice a day.This happens four times in last three day and during this, simple update/select statements started taking minutes (1 to 5 Minutes) instead of 5-50 mSec.
And this max out database 250 connections. This event halt all processes for about 15- 20 min and then everything back to normal. I verified checkpoint and vacuum related activities but this isn't showing any problem to me. (attached logs)
Top/vmstat output shows all resources were suddenly utilized by %us during same time. iostat doesn't shows any IO related bottleneck. I have added completed logs for yesterday outage (13:45 to 14:15) .
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
44 0 0 201929344 345260 50775772 0 0 2 15 2 2 2 0 98 0
40 0 0 201919264 345260 50775864 0 0 0 224 9409 1663 98 1 1 0
40 0 0 201915344 345260 50775880 0 0 0 280 8803 1674 99 0 0 0
38 0 0 201911296 345260 50775888 0 0 0 156 8753 1469 99 0 0 0
40 0 0 201902416 345260 50775888 0 0 0 224 9060 2775 98 1 1 0
Free -m
total used free shared buffers cached
Mem: 251 59 192 0 0 48
-/+ buffers/cache: 10 241
Swap: 29 0 29
System information.
Connections into our databases are coming from WebServer (running on PHP and Apache) and script servers (PHP).We have verified apache logs and we didn't find connection traffic during same interval.
Hardware information:
DELL PowerEdge R715
Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
Ubuntu 12.04.4 LTS
kernel: 3.8.0-35-generic 64 bit
Postgresql version: 9.0.13
RAM: 256 GB 32 Cores CPU
ps_xlog : RAID 1
data folder : RAID10 (6 Strips)
read:write ratio- 85:15
Pgbouncer configured on database side(250 allowed connections)
Postgresql Configuration:
Database Size: 28GB
Vaccum analyzed daily
checkpoint_completion_target = 0.9
maintenance_work_mem = 16GB
shared_buffers = 8GB # we reduced this from 32 GB.
max_connections = 300
checkpoint_segments = 32
checkpoint_timeout = 8min
detailed postgresql configuration: http://pastie.org/8754957
checkpoint/vacuum information http://pastie.org/8754954
Top command o/p: http://pastie.org/8755007
iostat o/p: http://pastie.org/8755009
sysctl.configuration : http://pastie.org/8755197
We have recently upgraded O/S kernels to fix this issue but this it didn't help. We are tried to modify some O/S parameters based on some discussions-
http://www.postgresql.org/message-id/50E4AAB1.9040902@xxxxxxxxxxxxxxxx
vm.dirty_background_bytes = 33554432 # I reduced this based on some forums.
vm.dirty_bytes = 536870912
vm.overcommit_memory=2
kernel.sched_migration_cost = 5000000
kernel.sched_autogroup_enabled = 0
We believe that our PostgreSQL configuration is not correct according to available memory on machine and need some urgent tuning into it.
Could you please guide me on troubleshooting this issue.
Thanks in advance.