On 07/03/2012 12:34 AM, Craig Ringer wrote:
On 07/03/2012 03:19 PM, Pawel Veselov
wrote:
Hi.
-- problem 1 --
I have an application, using libpq, connecting to postgres
9.1.3 (Amazon AMI distro).
The application writes data at a high rate (at this point
it's 500 transaction per second), using multiple threads (at
this point it's 800).
These are "worker" threads, that receive "messages" that
are then written out to the DB. There is no connection pool,
instead, each worker thread maintains it's own connection that
it uses to write data to the database. The connections are
kept pthread's "specific" data blocks.
[skipped, replied to separately]
Can't connect to DB: could not send data to server:
Transport endpoint is not connected
could not send startup packet: Transport endpoint is not
connected
postmaster forking and failing because of operating system
resource limits like max proc count, anti-forkbomb measures, max
file handles, etc?
If accept() succeeded, and fork() failed, the socket would be closed
by the process (parent will close, child socket wouldn't even be
forked), wouldn't that result into ECONNRESET, and not ENOTCONN?
-- problem 2 --
As I'm trying to debug this (with strace), I could never
reproduce it, at least to see what's going on, but sometimes I
get another error : "too many users connected". Even
restarting postmaster doesn't help. The postmaster is running
with -N810, and the role has connection limit of 1000. Yet,
the "too many" error starts creeping up only after 275
connections are opened (counted by successful connect() from
strace).
Any idea where should I dig?
See how many connections the *server* thinks exist by examining
pg_stat_activity .
Check dmesg and the PostgreSQL server logs to see if you're
hitting operating system limits. Look for fork() failures,
unexplained segfaults, etc.
That's the thing, no segfaults (dmesg), nothing in the server logs.
It may as well be some sort of an anti-fork-bomb measure, only
judging by the fact that with enough attempts, things do clear out,
though I wish there would be some indication of that, and I'm still
confused about the error code being ENOTCONN.
|