Re: pg_stat_activity showing non-existent processes

"Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> · Mon, 03 Apr 2006 17:57:53 -0500

>>> On Mon, Apr 3, 2006 at 11:52 am, in message
<14779.1144083156@xxxxxxxxxxxxx>,
Tom Lane <tgl@xxxxxxxxxxxxx> wrote: 
> "Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> writes:
>> Is there any way to tweak this in favor of more accurate
information,
>> even if has a performance cost?  We're finding that during normal
>> operations we're not seeing most connections added to the
>> pg_stat_activity table.  We would like to be able to count on
accurate
>> information there.
> 
> That's basically a non- starter because of the delay in reporting
from
> the stats collector process (ie, even if the information was
"completely
> accurate" it'd still be stale by the time that your code gets its
hands
> on it).  I think you'd be talking about a complete redesign of the
stats
> subsystem to be able to use it that way.

We want this for our monitoring software, to raise an alert when the
connection pool diverges from its nominal configuration beyond
prescribed limits or in excess of a prescribed duration.  What we're
looking for is not necessarily a table which is accurate immediately,
but one which won't entirely miss a connection.  Even then, if it only
misbehaves under extreme load, that would be OK; such extreme usage
might be worthy of note in and of itself.

Since we have converted to PostgreSQL we have not had this monitoring,
and folks are nervous that we will not detect a struggling middle tier
before it fails.  (Not something that happens often, but we really hate
having users tell us that something is broken, versus spotting the
impending failure and correcting it before it fails.)

> Having said that, though, I'd be pretty surprised if the stats
subsystem
> was dropping more than a small fraction of messages ---  I would
think
> that could only occur under very heavy load, and if that's your
normal
> operating state then it's time to upgrade your hardware ;- ).

We have a pair of database servers for our transaction repository. 
Each has four Xeon processors.  One of these is Windows, one is Linux. 
On the Windows machine, I see 10% CPU utilization.  On the Linux machine
I see a load average of 0.30.  The Linux machine seems to be very
reliable about showing the connections.  The Windows machine, when I
refresh a 20-connection pool, I either get no connections showing, or
only a few.

>  Maybe you
> should investigate a bit more closely to find out why it's dropping
so
> much.

It is probably related to something we've been seeing in the PostgreSQL
logs on the Windows servers:

[2006-04-03 08:28:25.990 ] 2072 FATAL:  could not read from statistics
collector pipe: No error
[2006-04-03 08:28:26.068 ] 2012 LOG:  statistics collector process (PID
3268) was terminated by signal 1

We're going to patch to try to capture more info from WinSock.

In src/port/pipe.c we plan to add before return ret in piperead():

if (ret == SOCKET_ERROR)
{
       ereport(LOG, (errmsg_internal("SOCKET ERROR: %ui",
WSAGetLastError())));
}

I hope to post more info, and possibly a patch, tomorrow.

-Kevin