Search Postgresql Archives

Spurious Stalls

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi Group,

My team has been very happy using Postgres, hosting Bitbucket.  Thanks very much for all the community contributions, to the platform.

Lately, though, about once a day now, for about a week, we have been experiencing periods of stalling.  When Postgres stalls, we haven't been able to recover, without restarting the database, unfortunately.

This brings our uptime down some, that we'd like to avoid (99.2%) :(  We'd like to do a better job keeping things running.

It would be great to get your input about it.  Alternately, if someone is available, as a consultant, that would be great too.

Here is some background, about the issue.  We have found the following symptoms.
  • During this performance issue, we found the following symptoms.
  • Running queries do not return.
  • The application sometimes can no longer get new connections.
  • The CPU load increases
  • There is no I/O wait.
  • There is no swapping.
Also, our database configuration, is attached to this email, as postgresql.conf, for reference, along with a profile of our hardware and tuning, as pg_db_profile.txt.

While the database was unavailable, we also collected a lot of data.  Looking through this info, a few things pop-out to us, that may be problematic, or useful to notice.
  • Disk I/O appears to be all write, and little read.
  • In previous incidents, with the same symptoms, we have seen pg processes spending much time in s_lock
  • That info is attached to this email also, as files named perf_*.
Additionally, monitoring graphs show the following performance profile.

Problem

As you can probably see below, at 11:54, the DB stops returning rows.

Also, transactions stop returning, causing the active transaction time to trend up to the sky.

Consequences of Problem

Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no longer connect.

The cpu utilization increases to nearly 100%, in user space, and stays there, until the database is restarted.

Events Before Problem

This is likely the most useful part.  As the time approaches 11:54, there are periods of increased latency.  There is also a marked increase in write operations, in general.
Lastly, about 10 minutes before outage, postgres writes a sustained 30 MB/s of temp files.


After investigating this, we found a query that was greatly exceeding work_mem.  We've since optimized it, and hopefully, that will have a positive effect on the above.

We may not know until the next issue happens, though.

With a problem like this, I am not exactly positive how to proceed.  I am really looking forward to hearing your thoughts, and opinions, if you can share them.

Thanks very much,

-Chris

A description of what you are trying to achieve and what results you expect.

    We were trying to handle our normal production load.
    Spuriously, postgres becomes unresponsive, though.

The EXACT PostgreSQL version you are running.

    postgres (PostgreSQL) 9.2.6

Postgres Installation:

    Postgres is installed from the "PostgreSQL RPM Building Project".

    YUM Repo:       http://yum.postgresql.org/9.2/redhat/rhel-6-x86_64/
    Arch / Version: CentOS 6 - x86_64
    
Changes made to the settings in the postgresql.conf file:

    postgres=# SELECT version();
                                                       version
    --------------------------------------------------------------------------------------------------------------
     PostgreSQL 9.2.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
    (1 row)

    postgres=# SELECT name, current_setting(name), source
    postgres-#   FROM pg_settings
    postgres-#   WHERE source NOT IN ('default', 'override');
                 name             |                 current_setting                  |        source
    ------------------------------+--------------------------------------------------+----------------------
     archive_command              | rsync -a %p 172.17.10.10:/backups/pg_archives/%f | configuration file
     archive_mode                 | on                                               | configuration file
     autovacuum                   | on                                               | configuration file
     autovacuum_freeze_max_age    | 1000000000                                       | configuration file
     autovacuum_max_workers       | 6                                                | configuration file
     checkpoint_completion_target | 0.9                                              | configuration file
     checkpoint_segments          | 256                                              | configuration file
     checkpoint_timeout           | 30min                                            | configuration file
     DateStyle                    | ISO, MDY                                         | configuration file
     default_text_search_config   | pg_catalog.english                               | configuration file
     effective_cache_size         | 176GB                                            | configuration file
     effective_io_concurrency     | 10                                               | configuration file
     lc_messages                  | en_US.UTF-8                                      | configuration file
     lc_monetary                  | en_US.UTF-8                                      | configuration file
     lc_numeric                   | en_US.UTF-8                                      | configuration file
     lc_time                      | en_US.UTF-8                                      | configuration file
     listen_addresses             | *                                                | configuration file
     log_autovacuum_min_duration  | 0                                                | configuration file
     log_checkpoints              | on                                               | configuration file
     log_destination              | stderr                                           | configuration file
     log_directory                | /var/log/postgres                                | configuration file
     log_filename                 | postgresql-%Y-%m-%d_%H%M%S.log                   | configuration file
     log_line_prefix              | %t [%p]: [%l-1]                                  | configuration file
     log_lock_waits               | on                                               | configuration file
     log_min_duration_statement   | 500ms                                            | configuration file
     log_rotation_age             | 1d                                               | configuration file
     log_rotation_size            | 0                                                | configuration file
     log_temp_files               | 0                                                | configuration file
     log_truncate_on_rotation     | on                                               | configuration file
     logging_collector            | on                                               | configuration file
     maintenance_work_mem         | 1GB                                              | configuration file
     max_connections              | 500                                              | configuration file
     max_stack_depth              | 2MB                                              | environment variable
     max_wal_senders              | 10                                               | configuration file
     pg_stat_statements.max       | 10000                                            | configuration file
     pg_stat_statements.track     | all                                              | configuration file
     port                         | 5432                                             | command line
     random_page_cost             | 2                                                | configuration file
     shared_buffers               | 6GB                                              | configuration file
     shared_preload_libraries     | pg_stat_statements                               | configuration file
     track_activities             | on                                               | configuration file
     track_counts                 | on                                               | configuration file
     vacuum_freeze_min_age        | 10000                                            | configuration file
     vacuum_freeze_table_age      | 500000000                                        | configuration file
     wal_buffers                  | 4MB                                              | configuration file
     wal_keep_segments            | 1536                                             | configuration file
     wal_level                    | hot_standby                                      | configuration file
     work_mem                     | 256MB                                            | configuration file
    (48 rows)


Operating system and version

    $ uname -a
    Linux db01 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

    $ cat /etc/redhat-release
    CentOS release 6.5 (Final)

For questions about any kind of error:
    What you were doing when the error happened / how to cause the error.
    The EXACT TEXT of the error message you're getting if there is one. Copy and paste the message to the email, do not send a screenshot.

What program you're using to connect to PostgreSQL

    The python library psycopg2 version 2.5.2

If you're using a connection pool, load balancer or application server, which one you're using and its version

    We are using 18 instances of pgbouncer.
    pgbouncer version 1.5.4 (compiled by <postgres@koji-sl6-x86-64-pg92> at 2012-12-10 11:22:04)

Is there anything remotely unusual in the PostgreSQL server logs?

    Preceeding crash:
        2014-06-10 22:53:51 GMT [52621]: [1-1] ERROR:  relation "pg_buffercache" does not exist at character 22
        2014-06-10 22:48:28 GMT [48822]: [1-1] ERROR:  syntax error at or near "pg_sleep" at character 1
        2014-06-10 21:39:05 GMT [80406]: [9-1] ERROR:  syntax error in tsquery: "[Route()]:*AB "
        2014-06-10 21:21:23 GMT [12348]: [2-1] DETAIL:  Key (user_id)=(2743403) already exists.
        2014-06-10 21:21:23 GMT [12348]: [1-1] ERROR:  duplicate key value violates unique constraint "google_googleprofile_user_id_key"

    Preceeding crash:
        2014-06-10 03:33:12 GMT [18339]: [1-1] FATAL:  remaining connection slots are reserved for non-replication superuser connections
        2014-06-10 03:32:14 GMT [17907]: [7-1] FATAL:  connection to client lost
        2014-06-10 03:32:10 GMT [17964]: [1-1] FATAL:  sorry, too many clients already
        2014-06-10 00:58:46 GMT [38934]: [8-1] HINT:  No function matches the given name and argument types. You might need to add explicit type cast


CPU manufacturer and model

    80 Cores

    vendor_id       : GenuineIntel
    model name      : Intel(R) Xeon(R) CPU E7- 4850  @ 2.00GHz
    cpu MHz         : 1995.010
    cache size      : 24576 KB
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm ida arat dts tpr_shadow vnmi flexpriority ept vpid
    bogomips        : 3990.02

Amount and size of RAM installed, eg "2GB RAM"

    250 GB RAM

Storage details (important for performance and corruption questions)

    Do you use a RAID controller?
    
        Yes.

    If so, what type of controller?

        Product Name:          PERC H700 Integrated
        Memory:                1024 MB
        BBU:                   Present
        Number of DISK GROUPS: 4
        Chipset:               LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)


    Does it have a battery backed cache module?
    
        Yes.

    Is write-back caching enabled?

        Yes.

    Do you use software RAID?

        No.

Is your PostgreSQL database on a SAN?

    No.

How many hard disks are connected to the system and what types are they?

     2x - Dell SSD - 46.5 GB
    14x - SEAGATE ST9146853SS - 146GB 15000 RPM 64MB Cache SAS 6Gb/s
      
How are your disks arranged for storage?

    Are you using RAID?

        Yes
    
    If so, what RAID level(s)?

        Virtual Drive Id  Drive Type  Drive Name       Size       RAID Type    Num Disks   Cache Policy
        ------------------------------------------------------------------------------------------------------------------
                       0  Virtual     Virtual Disk 0   136.1 GB   1 - Mirror           2   WriteBack, ReadAdaptive, Direct
                       1  CacheCade   CacheCade         92.0 GB   0 - Stripe           2   WriteBack, ReadAdaptive, Direct
                       2  Virtual     xlog             136.1 GB   1 - Mirror           2   WriteBack, ReadAdaptive, Direct
                       3  Virtual     data             680.6 GB   1 - Mirror          10   WriteBack, ReadAdaptive, Direct
    
    What PostgreSQL data is on what disks / disk sets?

        Drive Name       Mount Point      Filesystem  Postgres Use   Disk Distribution
        ----------------------------------------------------------------------------------
        Virtual Disk 0   /boot, /         ext3        None            2x - 146GB 15000 SAS
        xlog             /pg_xlog         ext3        xlog            2x - 146GB 15000 SAS
        data             /var/lib/pgsql   ext3        data           10x - 146GB 15000 SAS

        Device       Mount Point      Options
        ----------------------------------------------------------
        /dev/sda1 on /boot            ext4 (rw)
        /dev/sdb1 on /pg_xlog         ext4 (rw,noatime,nodiratime)
        /dev/sdc1 on /var/lib/pgsql   ext4 (rw,noatime,nodiratime)

Attachment: postgresql.conf
Description: Binary data

Attachment: perf_example_vmstat
Description: Binary data

Attachment: perf_example_dmesg
Description: Binary data

Attachment: perf_example_ipcs
Description: Binary data

blocked_pid,blocked_user,blocking_pid,blocking_user,blocked_statement

Attachment: perf_example_pginfo
Description: Binary data

Attachment: perf_example_ps_auxfww
Description: Binary data

Attachment: perf_example_iotop
Description: Binary data

Attachment: perf_example_strace.47700
Description: Binary data

Attachment: perf_example_backtrace.47700
Description: Binary data

Attachment: perf_example_stack.47700
Description: Binary data

Attachment: perf_example_status.47700
Description: Binary data

Attachment: perf_example_strace.46462
Description: Binary data

Attachment: perf_example_syscall.47700
Description: Binary data

Attachment: perf_example_backtrace.46462
Description: Binary data

Attachment: perf_example_stack.46462
Description: Binary data

Attachment: perf_example_status.46462
Description: Binary data

Attachment: perf_example_strace.29561
Description: Binary data

Attachment: perf_example_syscall.46462
Description: Binary data

Attachment: perf_example_backtrace.29561
Description: Binary data

Attachment: perf_example_stack.29561
Description: Binary data

Attachment: perf_example_status.29561
Description: Binary data

Attachment: perf_example_syscall.29561
Description: Binary data

Attachment: perf_example_strace.81372
Description: Binary data

Attachment: perf_example_backtrace.81372
Description: Binary data

Attachment: perf_example_stack.81372
Description: Binary data

Attachment: perf_example_status.81372
Description: Binary data

Attachment: perf_example_syscall.81372
Description: Binary data

Attachment: perf_example_vacuum
Description: Binary data


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux