Re: Intermittent hangs with 9.2

Andrew Dunstan <andrew@xxxxxxxxxxxx> · Tue, 10 Sep 2013 11:26:30 -0400

On 09/10/2013 11:04 AM, David Whittaker wrote:

Hi All,

I've been seeing a strange issue with our Postgres install for about a 
year now, and I was hoping someone might be able to help point me at 
the cause. At what seem like fairly random intervals Postgres will 
become unresponsive to the 3 application nodes it services. These 
periods tend to last for 10 - 15 minutes before everything rights 
itself and the system goes back to normal.

During these periods the server will report a spike in the outbound 
bandwidth (from about 1mbs to about 5mbs most recently), a huge spike 
in context switches / interrupts (normal peaks are around 2k/8k 
respectively, and during these periods they‘ve gone to 15k/22k), and a 
load average of 100+. CPU usage stays relatively low, but it’s all 
system time reported, user time goes to zero. It doesn‘t seem to be 
disk related since we’re running with a shared_buffers setting of 24G, 
which will fit just about our entire database into memory, and the IO 
transactions reported by the server, as well as the disk reads 
reported by Postgres stay consistently low.

We‘ve recently started tracking how long statements take to execute, 
and we’re seeing some really odd numbers. A simple delete by primary 
key, for example, from a table that contains about 280,000 rows, 
reportedly took 18h59m46.900s. An update by primary key in that same 
table was reported as 7d 17h 58m 30.415s. That table is frequently 
accessed, but obviously those numbers don't seem reasonable at all.

Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB 
of RAM, running Cent OS 6.3.

So far we‘ve tried disabling Transparent Huge Pages after I found a 
number of resources online that indicated similar interrupt/context 
switch issues, but it hasn’t resolve the problem. I managed to catch 
it happening once and run a perf which showed:

|
+  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
+   9.55%       10956  postmaster  0x2dc820         f set_config_option
+   8.64%        9946  postmaster  0x5a3d4          f writeListPage
+   5.75%        6609  postmaster  0x5a2b0          f ginHeapTupleFastCollect
+   2.68%        3084  postmaster  0x192483         f build_implied_join_equality
+   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
+   1.86%        2131  postmaster  0x794aa          f get_collation_oid
+   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
+   1.53%        1766  postmaster  0x1929bc         f distribute_qual_to_rels
+   1.33%        1558  postmaster  0x249671         f cmp_numerics|

I‘m not sure what 0x347ba9 represents, or why it’s an address rather 
than a method name.

That's about the sum of it. Any help would be greatly appreciated and 
if you want any more information about our setup, please feel free to ask.

I have seen cases like this with very high shared_buffers settings.

24Gb for shared_buffers is quite high, especially on a 48Gb box. What 
happens if you dial that back to, say, 12Gb?

cheers

andrew

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance