Re: max_connections reached in postgres 9.3.3

Borislav Ivanov <bivanov@xxxxxxxxxxxxx> · Thu, 19 Jun 2014 17:26:37 -0700

We do record perf data. For each incident we've had the data looks about the same. Unfortunately, I can't read much into it. Besides it getting stuck on a spinlock. But why and with what? 

### from perf report

 53.28%       postmaster  postgres               [.] s_lock                                                                        
  6.22%       postmaster  postgres               [.] 0x00000000001b4306                                                            
  2.30%       postmaster  [kernel.kallsyms]      [k] _spin_lock                                                                    
  2.23%       postmaster  postgres               [.] LWLockAcquire                                                                 
  1.79%       postmaster  postgres               [.] LWLockRelease                                                                 
  1.39%       postmaster  postgres               [.] hash_search_with_hash_value                                                   
  1.20%       postmaster  postgres               [.] SearchCatCache                                                                
  0.71%       postmaster  [kernel.kallsyms]      [k] hrtimer_interrupt                                                             
  0.56%       postmaster  [kernel.kallsyms]      [k] tick_program_event                                                            
  0.54%       postmaster  [kernel.kallsyms]      [k] schedule                                                                      
  0.44%       postmaster  [kernel.kallsyms]      [k] _spin_lock_irq 

### Then zooming on s_lock
# Annotate s_lock
...

 99.04 │       test   %al,%al 
...

# Zoom into postmaster(81487) thread

 55.84%  postmaster  postgres               [.] s_lock                                                                            ◆
  6.52%  postmaster  postgres               [.] 0x00000000001b4306                                                                ▒
  2.42%  postmaster  [kernel.kallsyms]      [k] _spin_lock                                                                        ▒
  2.34%  postmaster  postgres               [.] LWLockAcquire                                                                     ▒
  1.87%  postmaster  postgres               [.] LWLockRelease                                                                     ▒
  1.46%  postmaster  postgres               [.] hash_search_with_hash_value                                                       ▒
  1.26%  postmaster  postgres               [.] SearchCatCache                                                                    ▒
  0.75%  postmaster  [kernel.kallsyms]      [k] hrtimer_interrupt                                                                 ▒
  0.59%  postmaster  [kernel.kallsyms]      [k] tick_program_event                                                                ▒

  0.57%  postmaster  [kernel.kallsyms]      [k] schedule 

# Zoom into postgres DSO
 65.75%  postmaster  [.] s_lock                                                                                                   ◆
  7.68%  postmaster  [.] 0x00000000001b4306                                                                                       ▒
  2.75%  postmaster  [.] LWLockAcquire                                                                                            ▒
  2.20%  postmaster  [.] LWLockRelease                                                                                            ▒
  1.72%  postmaster  [.] hash_search_with_hash_value                                                                              ▒
  1.49%  postmaster  [.] SearchCatCache                                                                                           ▒
  0.54%  postmaster  [.] _bt_compare                                                                                              ▒

  0.51%  postmaster  [.] _bt_checkkeys 

On 19 June 2014 15:57, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:

On Thu, Jun 19, 2014 at 5:12 PM, Borislav Ivanov <bivanov@xxxxxxxxxxxxx> wrote:

> However, most people on our team think that the number of connections is

> purely a symptom of the actual problem. We would love to be wrong about

> this. But for now we feel the high number of connections contributes for

> preserving the problem but it's not actually triggering the problem.

This is entirely correct.  pgbouncer does not preventing database load

but about limiting damage when it occurs.  This generally necessary in

environments where application servers keep piling on connections when

the database is not clearing queries fast enough.

In your case user% is dominating system load.  Along with the high cs

this is really suggesting spinlock contention.  A 'perf top' is

essential for identifying the culprit.  It's very possible that 9.4

will fix your problem...see:

http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html.

There was some poorly optimized code in the wal replay.

merlin