Re: Intermittent hangs with 9.2

"ktm@xxxxxxxx" <ktm@xxxxxxxx> · Tue, 10 Sep 2013 10:33:48 -0500

On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote:
> Hi All,
> 
> I've been seeing a strange issue with our Postgres install for about a year
> now, and I was hoping someone might be able to help point me at the cause.
> At what seem like fairly random intervals Postgres will become unresponsive
> to the 3 application nodes it services. These periods tend to last for 10 -
> 15 minutes before everything rights itself and the system goes back to
> normal.
> 
> During these periods the server will report a spike in the outbound
> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> context switches / interrupts (normal peaks are around 2k/8k respectively,
> and during these periods they‘ve gone to 15k/22k), and a load average of
> 100+. CPU usage stays relatively low, but it’s all system time reported,
> user time goes to zero. It doesn‘t seem to be disk related since we’re
> running with a shared_buffers setting of 24G, which will fit just about our
> entire database into memory, and the IO transactions reported by the
> server, as well as the disk reads reported by Postgres stay consistently
> low.
> 
> We‘ve recently started tracking how long statements take to execute, and
> we’re seeing some really odd numbers. A simple delete by primary key, for
> example, from a table that contains about 280,000 rows, reportedly took
> 18h59m46.900s. An update by primary key in that same table was reported as
> 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
> numbers don't seem reasonable at all.
> 
> Some other changes we've made to postgresql.conf:
> 
> synchronous_commit = off
> 
> maintenance_work_mem = 1GB
> wal_level = hot_standby
> wal_buffers = 16MB
> 
> max_wal_senders = 10
> 
> wal_keep_segments = 5000
> 
> checkpoint_segments = 128
> 
> checkpoint_timeout = 30min
> 
> checkpoint_completion_target = 0.9
> 
> max_connections = 500
> 
> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> RAM, running Cent OS 6.3.
> 
> So far we‘ve tried disabling Transparent Huge Pages after I found a number
> of resources online that indicated similar interrupt/context switch issues,
> but it hasn’t resolve the problem. I managed to catch it happening once and
> run a perf which showed:
> 
> 
> +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> +   9.55%       10956  postmaster  0x2dc820         f
> set_config_option
> +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> +   5.75%        6609  postmaster  0x5a2b0          f
> ginHeapTupleFastCollect
> +   2.68%        3084  postmaster  0x192483         f
> build_implied_join_equality
> +   2.61%        2990  postmaster  0x187a55         f
> build_paths_for_OR
> +   1.86%        2131  postmaster  0x794aa          f
> get_collation_oid
> +   1.56%        1822  postmaster  0x5a67e          f
> ginHeapTupleFastInsert
> +   1.53%        1766  postmaster  0x1929bc         f
> distribute_qual_to_rels
> +   1.33%        1558  postmaster  0x249671         f cmp_numerics
> 
> I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
> method name.
> 
> That's about the sum of it. Any help would be greatly appreciated and if
> you want any more information about our setup, please feel free to ask.
> 
> Thanks,
> Dave

Hi Dave,

A load average of 100+ means that you have that many processes waiting to
run yet you only have 16 cpus. You really need to consider using a connection
pooler like pgbouncer to keep your connection count in the 16-32 range.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance