Re: Scaling concerns

"Luke Lonergan" <llonergan@xxxxxxxxxxxxx> · Sat, 16 Dec 2006 23:18:25 -0800

Tsuraan,

"Select count(*) from bigtable" is testing your disk drive speed up till
about 300MB/s, after which it is CPU limited in Postgres.

My guess is that your system has a very slow I/O configuration, either due
to faulty driver/hardware or the configuration.

The first thing you should do is run a simple I/O test on your data
directory - write a file twice the size of memory using dd like this:

  time bash -c "dd if=/dev/zero of=data_directory/bigfile bs=8k count=(2 *
memory_size / 8192) && sync"

  time dd if=data_directory/bigfile of=/dev/null bs=8k

Then report the times here.

- Luke

On 12/16/06 9:26 AM, "tsuraan" <tsuraan@xxxxxxxxx> wrote:

> I'm writing a webmail-type application that is meant to be used in a
> corporate environment.  The core of my system is a Postgres database
> that is used as a message header cache.  The two (relevant) tables
> being used are pasted into the end of this message.  My problem is
> that, as the messages table increases to tens of millions of rows,
> pgsql slows down considerably.  Even an operation like "select
> count(*) from messages" can take minutes, with a totally idle system.
> Postgres seems to be the most scalable Free database out there, so I
> must be doing something wrong.
> 
> As for the most common strategy of having a slower (more rows)
> "archival" database and a smaller, faster "live" database, all the
> clients in the company are using their normal corporate email server
> for day-to-day email handling.  The webmail is used for access email
> that's no longer on the corporate server, so it's not really simple to
> say which emails should be considered live and which are really
> out-of-date.
> 
> My postgres settings are entirely default with the exception of
> shared_buffers being set to 40,000 and max_connections set to 400.
> I'm not sure what the meaning of most of the other settings are, so I
> haven't touched them.  The machines running the database servers are
> my home desktop (a dual-core athlon 3200+ with 2GB RAM and a 120GB
> SATA II drive), and a production server with two dual-core Intel
> chips, 4 GB RAM, and a RAID 5 array of SATA II drives on a 3Ware 9550
> controller.  Both machines are running Gentoo Linux with a 2.6.1x
> kernel, and both exhibit significant performance degradation when I
> start getting tens of millions of records.
> 
> Any advice would be most appreciated.  Thanks in advance!
> 
> Tables:
> 
> CREATE TABLE EmailAddresses (
>   emailid   SERIAL PRIMARY KEY,     --  The unique identifier of this address
>   name      TEXT NOT NULL,          --  The friendly name in the address
>   addrspec  TEXT NOT NULL,          --  The user@domain part of the address
>   UNIQUE(name, addrspec)
> );
> 
> and
> 
> CREATE TABLE Messages (
>   -- Store info:
>   msgkey    BIGSERIAL PRIMARY KEY,  --  Unique identifier for a message
>   path      TEXT NOT NULL,          --  Where the message is on the file
> system
>   inserted  TIMESTAMP DEFAULT now(),--  When the message was fetched
>   -- Message Info:
>   msgid     TEXT UNIQUE NOT NULL,   --  Message's Message-Id field
>   mfrom     INTEGER                 --  Who sent the message
>             REFERENCES EmailAddresses
>             DEFAULT NULL,
>   mdate     TIMESTAMP DEFAULT NULL, --  Message "date" header field
>   replyto   TEXT DEFAULT NULL,      --  Message-ID of replied-to message
>   subject   TEXT DEFAULT NULL,      --  Message "subject" header field
>   numatch   INTEGER DEFAULT NULL,   --  Number of attachments
>   UNIQUE(path)
> );
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
>