On Wed, Sep 11, 2013 at 12:17 PM, Andres Freund <andres@xxxxxxxxxxxxxxx> wrote: > On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote: >> > I've been seeing a strange issue with our Postgres install for about a year >> > now, and I was hoping someone might be able to help point me at the cause. >> > At what seem like fairly random intervals Postgres will become unresponsive >> > to the 3 application nodes it services. These periods tend to last for 10 - >> > 15 minutes before everything rights itself and the system goes back to >> > normal. >> > >> > During these periods the server will report a spike in the outbound >> > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in >> > context switches / interrupts (normal peaks are around 2k/8k respectively, >> > and during these periods they‘ve gone to 15k/22k), and a load average of >> > 100+. CPU usage stays relatively low, but it’s all system time reported, >> > user time goes to zero. It doesn‘t seem to be disk related since we’re >> > running with a shared_buffers setting of 24G, which will fit just about our >> > entire database into memory, and the IO transactions reported by the server, >> > as well as the disk reads reported by Postgres stay consistently low. >> > >> > We‘ve recently started tracking how long statements take to execute, and >> > we’re seeing some really odd numbers. A simple delete by primary key, for >> > example, from a table that contains about 280,000 rows, reportedly took >> > 18h59m46.900s. An update by primary key in that same table was reported as >> > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those >> > numbers don't seem reasonable at all. >> > >> > Some other changes we've made to postgresql.conf: >> > >> > synchronous_commit = off >> > >> > maintenance_work_mem = 1GB >> > wal_level = hot_standby >> > wal_buffers = 16MB >> > >> > max_wal_senders = 10 >> > >> > wal_keep_segments = 5000 >> > >> > checkpoint_segments = 128 >> > >> > checkpoint_timeout = 30min >> > >> > checkpoint_completion_target = 0.9 >> > >> > max_connections = 500 >> > >> > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of >> > RAM, running Cent OS 6.3. >> > >> > So far we‘ve tried disabling Transparent Huge Pages after I found a number >> > of resources online that indicated similar interrupt/context switch issues, >> > but it hasn’t resolve the problem. I managed to catch it happening once and >> > run a perf which showed: >> > >> > + 41.40% 48154 postmaster 0x347ba9 f 0x347ba9 >> > + 9.55% 10956 postmaster 0x2dc820 f set_config_option >> > + 8.64% 9946 postmaster 0x5a3d4 f writeListPage >> > + 5.75% 6609 postmaster 0x5a2b0 f >> > ginHeapTupleFastCollect >> > + 2.68% 3084 postmaster 0x192483 f >> > build_implied_join_equality >> > + 2.61% 2990 postmaster 0x187a55 f build_paths_for_OR >> > + 1.86% 2131 postmaster 0x794aa f get_collation_oid >> > + 1.56% 1822 postmaster 0x5a67e f ginHeapTupleFastInsert >> > + 1.53% 1766 postmaster 0x1929bc f >> > distribute_qual_to_rels >> > + 1.33% 1558 postmaster 0x249671 f cmp_numerics >> > >> > I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a >> > method name. > > Try converting it to something more meaningful with "addr2line", that > often has more sucess. > >> > That's about the sum of it. Any help would be greatly appreciated and if you >> > want any more information about our setup, please feel free to ask. > >> Reducing shared buffers to around 2gb will probably make the problem go away > > That profile doesn't really look like one of the problem you are > referring to would look like. yup -- I think you're right. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance