On 21.2.2014 13:22, Ashutosh Durugkar wrote: > Hello Postgresql experts, > > We are facing issues with our PostgreSQL databases running on Ubuntu > server, right after we shifted our databases from OpenSuse O/S. > > It's a new database servers runs fine for most of the time (Avg. > Load 0.5 to 1.0) but suddenly spikes once/twice a day.This happens > four times in last three day and during this, simple update/select > statements started taking minutes (1 to 5 Minutes) instead of 5-50 > mSec. > > And this max out database 250 connections. This event halt all processes > for about 15- 20 min and then everything back to normal. I verified > checkpoint and vacuum related activities but this isn't showing any > problem to me. (attached logs) That is pretty high number of connections, considering the number of CPUs / spindles. That may easily turn into a big issue, considering the configuration (see below). > Top/vmstat output shows all resources were suddenly utilized by %us > during same time. iostat doesn't shows any IO related bottleneck. I have > added completed logs for yesterday outage (13:45 to 14:15) . > > > procs -----------memory---------- ---swap-- -----io---- -system-- > ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy > id wa > 44 0 0 201929344 345260 50775772 0 0 2 15 2 2 > 2 0 98 0 > 40 0 0 201919264 345260 50775864 0 0 0 224 9409 1663 > 98 1 1 0 > 40 0 0 201915344 345260 50775880 0 0 0 280 8803 1674 > 99 0 0 0 > 38 0 0 201911296 345260 50775888 0 0 0 156 8753 1469 > 99 0 0 0 > 40 0 0 201902416 345260 50775888 0 0 0 224 9060 2775 > 98 1 1 0 What about top? Who's eating the CPU? Backend processes or something else? Try to run "perf top" to see what functions are at the top (I suspect it might be related to spinlocks). > Free -m > total used free shared buffers cached > Mem: 251 59 192 0 0 48 > -/+ buffers/cache: 10 241 > Swap: 29 0 29 > > System information. > > Connections into our databases are coming from WebServer (running on PHP > and Apache) and script servers (PHP).We have verified apache logs and we > didn't find connection traffic during same interval. > > Hardware information: > > DELL PowerEdge R715 > Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz > Ubuntu 12.04.4 LTS > kernel: 3.8.0-35-generic 64 bit > Postgresql version: 9.0.13 > RAM: 256 GB 32 Cores CPU > ps_xlog : RAID 1 > data folder : RAID10 (6 Strips) > read:write ratio- 85:15 > > Pgbouncer configured on database side(250 allowed connections) > > Postgresql Configuration: > Database Size: 28GB So you have 28GB database with 256GB of RAM? > Vaccum analyzed daily > checkpoint_completion_target = 0.9 > maintenance_work_mem = 16GB > shared_buffers = 8GB # we reduced this from 32 GB. Good. > max_connections = 300 > checkpoint_segments = 32 > checkpoint_timeout = 8min > > detailed postgresql configuration: http://pastie.org/8754957 Nothing really suspicious here, i.e. nothing I could point out as an obvious cause. A few notes, though - shared_buffers : 8GB seems about right - maintenance_work_mem=16GB : set it to 512MB-1GB, with just 28GB database there's no point in using 16GB (and I'm yet to see a database where a value this high actually improves anything) - max_connections=300 : With 32 cores, you can handle ~40-50 connections tops, assuming all of them are active, so if you know most of the 300 connections are idle it's fine. Which is about the number of connections in the vmstat output you posted. - work_mem : 256MB seems ok, but depending on the queries you're executing (with 300 connections and moderately complex queries, this may easily explode into your face) - full_page_writes = off # we turned it off for increase performance. Seriously? That's nonsense and you may easily end up with corrupted database. Tweak it only if you have actual performance issues and if you know your storage won't cause torn pages. Also, this only helps with I/O problems, which is not your case. Using the same logic, you might set 'fsync=off' to "fix" performance issues (don't do that!). > checkpoint/vacuum information http://pastie.org/8754954 Seems fine. The checkpoints are ~5% at most, i.e. ~400MB, which should not be a big deal. The iostat log is fine so checkpoints are not the issue. > Top command o/p: http://pastie.org/8755007 Well, it seems > iostat o/p: http://pastie.org/8755009 > sysctl.configuration : http://pastie.org/8755197 > > We have recently upgraded O/S kernels to fix this issue but this it > didn't help. We are tried to modify some O/S parameters based on some > discussions- > > http://www.postgresql.org/message-id/50E4AAB1.9040902@xxxxxxxxxxxxxxxx Not sure how that's related, as the processes are spending time in "user time". BTW, is that a NUMA machine? > vm.dirty_background_bytes = 33554432 # I reduced this based on some forums. > vm.dirty_bytes = 536870912 I don't think you want to do this related to the issue, but are you sure you want to do this? vm.dirty_background_bytes = 33554432 vm.dirty_bytes = 536870912 IMHO that's way low. I mean, forcing the processes to block the IO if there's more than 512MB of dirty data in page cache. What I usually do is something like vm.dirty_background_bytes = :write cache on controller: vm.dirty_bytes = 4-8x dirty_background_bytes Assuming you have write cache with BBU, of course. > vm.overcommit_memory=2 Well, so how much swap you have? This together with vm.overcommit_ratio = 50 (which you mentioned in sysctl.conf) means "use only 50% of RAM, plus swap". I assume you have just a few GB of swap, so you've just thrown away ~50% of RAM. Not the best idea, IMHO. Anyway, I'm not sure it's the cause, given you have ~28GB database, which easily fits into RAM. Although you have rather high number of connections, which may cause issues. > kernel.sched_migration_cost = 5000000 > kernel.sched_autogroup_enabled = 0 > > We believe that our PostgreSQL configuration is not correct according > to available memory on machine and need some urgent tuning into it. No, the configuration seems reasonable to me (with the exceptions mentioned above). > Could you please guide me on troubleshooting this issue. If I had to guess, I'd expect it to be one of these issues (1) locking issue - spinlocks or lwlocks - e.g. all sessions are trying to acquire the same lock (or a small number of locks), for example by updating the same row, or maybe it's about spinlocks (which is consistent with high CPU usage) - lwlocks: collect snapshot from pg_locks (WHERE NOT granted) - spinlocks: run "perf top" (check CPU consumed by __spin__lock) (2) sudden change of application behavior - such issues happen when the application server suddenly reconnects all the connections and re-executes expensive tasks, etc. - there's ~20 sessions in "authentication" state (suspicious) - investigate what happens at the application side (3) sudden change of execution plans - assuming the executed queries remain the same, something else had to change - for example execution plans - also, this might be an issue with PREPARED statements, see http://www.postgresql.org/message-id/CAFj8pRDCingX=b42+FoMM+pk7JL63zUXc3d48OMpaqHxrhSpeA@xxxxxxxxxxxxxx - try to collect the "bad" execution plans and compare them to plans when the database is performing normally - consider using log_min_duration_statement / auto_explain regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance