errors with high connections rate

Pawel Veselov <pawel.veselov@xxxxxxxxx> · Tue, 3 Jul 2012 00:19:24 -0700

Hi.

-- problem 1 --

I have an application, using libpq, connecting to postgres 9.1.3 (Amazon AMI distro).
The application writes data at a high rate (at this point it's 500 transaction per second), using multiple threads (at this point it's 800).

These are "worker" threads, that receive "messages" that are then written out to the DB. There is no connection pool, instead, each worker thread maintains it's own connection that it uses to write data to the database. The connections are kept pthread's "specific" data blocks.

Each thread would connect to the DB when the first work message is received, or when there was an "error" flag with a connection. The error flag is set any time there is any error running a database statement.

When the work is "slow", I don't see any problem (slow was ~250 messages per second). As I increased the load, when I restart the process, threads start grabbing work at high enough rate, and each will first open a connection to the database, and these errors start popping up:

Can't connect to DB: could not send data to server: Transport endpoint is not connected
could not send startup packet: Transport endpoint is not connected

This is a result of executing the following code:

    wi->pg_conn = PQconnectdb(conn_str);
    ConnectionStatusType cst = PQstatus(wi->pg_conn);

    if (cst != CONNECTION_OK) {

        ERR("Can't connect to DB: %s\n", PQerrorMessage(wi->pg_conn));
    }

Eventually, the errors go away (when the worker thread fail to connect, they just pass the message to another thread, and wait for their turn, and will try reconnecting again), so it does seem that the remedy is just spreading the connections in time.

The connection string is '' (empty), the connection is made through /tmp/.s.PGSQL.5432

I don't see these errors when:
1) the amount of worker threads is reduced (could never reproduce it under 200 or less, but seen them with 300 and more)

2) the amount of load is reduced

-- problem 2 --

As I'm trying to debug this (with strace), I could never reproduce it, at least to see what's going on, but sometimes I get another error : "too many users connected". Even restarting postmaster doesn't help. The postmaster is running with -N810, and the role has connection limit of 1000. Yet, the "too many" error starts creeping up only after 275 connections are opened (counted by successful connect() from strace).

Any idea where should I dig?

P.S. I looked at fe-connect.c, I'm wondering if there a potential race condition between poll() and socket actually finishing the connection? If running under strace, I never see EINPROGRESS returned from connect(), and the only reason sendto() would result into ENOTCONN is when the connect didn't finish, and the socket was deemed "connected" using poll/getsockopt...

Thanks,
  Pawel.