Re: errors with high connections rate

Craig Ringer <ringerc@xxxxxxxxxxxxx> · Tue, 03 Jul 2012 15:34:44 +0800



    On 07/03/2012 03:19 PM, Pawel Veselov
      wrote:

    
      Hi.
      

      -- problem 1 --
      

      I have an application, using libpq, connecting to postgres
        9.1.3 (Amazon AMI distro).
      The application writes data at a high rate (at this point
        it's 500 transaction per second), using multiple threads (at
        this point it's 800).
      

      These are "worker" threads, that receive "messages" that are
        then written out to the DB. There is no connection pool,
        instead, each worker thread maintains it's own connection that
        it uses to write data to the database. The connections are kept
        pthread's "specific" data blocks.
    
    
    Hmm. To get that kind of TPS with that design are you running with
    fsync=off or on storage that claims to flush I/O without actually
    doing so? Have you checked your crash safety? Is it just fairly big
    hardware?

    
    Why are you using so many connections? Unless you have truly
    monstrous hardware your system should achieve considerably greater
    throughput by reducing the connection count and queueing bursts of
    writes. You wouldn't even need an external pool in your case, just
    switch to a producer/consumer model where your accepting threads add
    work to separate and much fewer writer threads for sending to the
    DB. Writer threads could then do useful optimisations like
    multi-value-inserting or COPYing data, doing small batches in
    transactions, etc.

    
    I'm seriously impressed that your system is working under load at
    all with 800 concurrent connections fighting to write all at once.

    
        Can't connect to DB: could not send data to server:
          Transport endpoint is not connected
        could not send startup packet: Transport endpoint is not
          connected
      
    
    postmaster forking and failing because of operating system resource
    limits like max proc count, anti-forkbomb measures, max file
    handles, etc?

    
    -- problem 2 --
      

      As I'm trying to debug this (with strace), I could never
        reproduce it, at least to see what's going on, but sometimes I
        get another error : "too many users connected". Even restarting
        postmaster doesn't help. The postmaster is running with -N810,
        and the role has connection limit of 1000. Yet, the "too many"
        error starts creeping up only after 275 connections are opened
        (counted by successful connect() from strace).
      

      Any idea where should I dig?
    
    See how many connections the *server* thinks exist by examining
    pg_stat_activity .

    
    Check dmesg and the PostgreSQL server logs to see if you're hitting
    operating system limits. Look for fork() failures, unexplained
    segfaults, etc.

    
    --

    Craig Ringer