Re: Spurious Stalls

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Thu, 12 Jun 2014 17:30:07 -0600

On Thu, Jun 12, 2014 at 1:57 PM, Christopher Nielsen
<cnielsen@xxxxxxxxxxxxx> wrote:

> Lately, though, about once a day now, for about a week, we have been experiencing periods of stalling.  When Postgres stalls, we haven't been able to recover, without restarting the database, unfortunately.

Just the db and not the server right? Have you tried killing any
individual queries that seem to be problem / long running queries?

> Here is some background, about the issue.  We have found the following symptoms.
>
> During this performance issue, we found the following symptoms.
> Running queries do not return.

Even simple ones like "select 1;" ?
Or ones that are something like "select * from smalltable where id=123; ?

> The application sometimes can no longer get new connections.
> The CPU load increases
> There is no I/O wait.
> There is no swapping.

> Also, our database configuration, is attached to this email, as postgresql.conf, for reference, along with a profile of our hardware and tuning, as pg_db_profile.txt.

How much memory does your machine have? How much of that is shared buf
then? If it's got 8G and 6G shared_buffers I'd lower shared_buffers.
etc
What is your IO subsystem like? Is it a single 7200 RPM SATA drive, an
array of 16 15krpm disks under a caching raid controller? A 1TB
FusionIO card?
Just because linux SAYS there's wait doesn't mean there isn't one.
Sometimes the io wait numbers are zero while some kernel daemon uses
up all the bandwidth and you really are waiting.
Your charts show 35MB/s write. If that's all random, that's a LOT of
writes on anything but SSDs and even then it's a fair bit for a busy
db server if it's being done by 100 or more processes, which is what
this looks like.

> While the database was unavailable, we also collected a lot of data.  Looking through this info, a few things pop-out to us, that may be problematic, or useful to notice.
>
> Disk I/O appears to be all write, and little read.
> In previous incidents, with the same symptoms, we have seen pg processes spending much time in s_lock
> That info is attached to this email also, as files named perf_*.
>
> Additionally, monitoring graphs show the following performance profile.
>
> Problem
>
> As you can probably see below, at 11:54, the DB stops returning rows.
>
> Also, transactions stop returning, causing the active transaction time to trend up to the sky.
>
>
> Consequences of Problem
>
> Once transactions stop returning, we see connections pile-up.  Eventually, we reach a max, and clients can no longer connect.

You need to lower your max connections and institute connection
pooling yesterday. 500 connections and 256MB work mem are a recipe for
disaster. While a machine with a db pool in front of it can survive
such scenarios, lack of a pool and high work mem are killing your
machine.

I recommend pgbouncer. Dirt simple, fast, and will keep your incoming
connections limited to something your db can handle.