Large DB

"Mooney, Ryan" <ryan.mooney@pnl.gov> · Tue, 30 Mar 2004 17:48:14 -0800

Hello, 

I have a single table that just went over 234GB in size with about 290M+
rows.  I think that I'm starting to approach some limits since things
have gotten quite a bit slower over the last couple days.  The table is
really simple and I'm mostly doing simple data mining queries like the
query included below.  These have gone from taking a under a few minutes
to taking tens of minutes over the last week or so (a query like the
below would generally return 1-6 million rows from the base table).  The
queries use the indexes fairly well, although I suspect that the order
of host/starttime is suboptimal (fewer hosts than starttime, and the
table is naturally in starttime order).  I'm going to try adding an
index on just starttime (and later just host) and see if I can tune the
queries on that more.  I never delete rows from the table, only do
inserts (up to around 11,000/minute mostly in one big burst every
minute, this is anticipated to go up some over time).  There are about
32 processes doing the inserts (on the same machine - yeah I know it'd
be happier if they moved); I think it might help if there was only one,
but for architectural reasons that won't happen for a while.

This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array
(ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd
tried FBSD 4/5 early on, but the insert speeds were actually better with
RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to
migrate easily now).

I'm trying to figure out ways to squeak another ounce or two of
performance out of this machine, I've included the things I've tuned so
far below.

The query processes are mostly stuck in D state so I expect that I'm
hitting some hw limitations, but I'm only doing sub 15MB from the disk
array (from iostat) and I know it can do in the 40-60MB range when we
tested the raw speed, and only 400 or so tps which is also well under
the arrays limits so I suspect that its thrashing a bit, this is also
indicated by the contrast between rrqm/s (read requests merged per
second) which is pushing 2000 and the actual r/s (read requests that
were issued to the device) at around 400 or so (same as tps).  I suspect
that a lot of the time is spent reading indexes, so a "better" indexing
scheme may be my best bet.

Estimating the table size
-------------------------

stats=> select relfilenode,relname from pg_class where
relfilenode=37057796;
 relfilenode | relname
-------------+---------
    37057796 | tp3

du -sc 37057796* | grep total
234002372       total

However the two indexes are also - large (which may be part of the
problem, which is why I'm trying just starttime for an index; They are
currently in the 140-150G range).

The query optimizer thinks I have ~ 290M rows (I'm not actually doing a
real select count since the last time I did that was around 10M rows or
so and it took a long while, I don't want to wait days :).
------------------------------
stats=> explain select count(*) from tp3;
                              QUERY PLAN
-----------------------------------------------------------------------
 Aggregate  (cost=7632998.20..7632998.20 rows=1 width=0)
   ->  Seq Scan on tp3  (cost=0.00..6906493.16 rows=290602016 width=0)
(2 rows)

Table def
----------------------------
stats=> \d tp3
                  Table "public.tp3"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 host        | character(4)                |
 point       | character varying(64)       |
 type        | character(1)                |
 cooked      | character(1)                |
 starttime   | timestamp without time zone |
 intervallen | interval                    |
 arrivetime  | timestamp without time zone |
 pvalue      | numeric                     |
Indexes:
    "tp3_host_starttime" btree (host, starttime, cooked)
    "tp3_point_starttime" btree (point, starttime, cooked)

Sample data mining query:
----------------------------
select point, avg(pvalue) as avg from tp3 where host in ('node', 'node',
....) and starttime between 'timestamp' and 'timestamp' group by point

Tuning done so far:
----------------------------
$ cat /etc/sysctl.conf

kernel.shmall=805306368
kernel.shmmax=805306368

$ egrep -v "^#|^$" postgresql.conf

shared_buffers = 60800
sort_mem = 1286720              # min 64, size in KB
vacuum_mem = 102400             # min 1024, size in KB
fsync=false   # Play fast and loose - whee
max_files_per_process = 1000
wal_buffers = 16
checkpoint_segments = 20        
checkpoint_timeout = 100
effective_cache_size = 160000

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@postgresql.org so that your
      message can get through to the mailing list cleanly