Re: Large DB

Ericson Smith <eric@did-it.com> · Wed, 31 Mar 2004 01:21:47 -0500

The issue here might be just organizing the data differently. Or getting 
an Opteron server with 16GB RAM :-) Based on the strength of the 
developers recommendations in this newsgroup, we recently upgraded to a 
dual Opteron 2GHZ with 16GB Ram and 15K hard drives. We set 
shared_buffers to 40,000 (just about 320MB Ram), and the difference is 
amazing. Just having the OS handle the caching has made all the 
difference. You can actually see lots of blocks getting cached by the 
OS. (RH Linux Enterprise in our case). In most cases, tables with 
millions of records would get entirely cached in RAM, and there would be 
no disk access whatsoever for selects in a few minutes.

Based on the queries you run, is it possible to split up the schema into 
different tables? Are the differences between timestamps in the sample 
query usually small? We had a similar problem, although with a slightly 
smaller data set -- but one that was going to keep growing.

Our questions were: how could we scale? What about vacuuming our tables, 
running analyze in a decent time? backing up? and so on. We found that 
most of the queries we wanted were in the domain of a day. So we 
actually split up that giant table and made one for each day. We could 
have done it one for each week as well, but the daily tables worked well 
for us. Sure, its a bit more work getting data over a long time period, 
but those common queries were a cinch.

We've also seen that in cases were we have to dump in thousands of 
records every few minutes that select queries respond remarkedly faster 
when frequent (one or two every hour) ANALYZE's are done even on those 
daily tables which contain just a few mil records each.

Tweaking the hardware IMHO would probably take you just a little 
further, but you gotta think about what your response times will be in 
another month based on your growth now. Can your schema stand it then?

- Ericson Smith

Mooney, Ryan wrote:

Hello, 

I have a single table that just went over 234GB in size with about 290M+
rows.  I think that I'm starting to approach some limits since things
have gotten quite a bit slower over the last couple days.  The table is
really simple and I'm mostly doing simple data mining queries like the
query included below.  These have gone from taking a under a few minutes
to taking tens of minutes over the last week or so (a query like the
below would generally return 1-6 million rows from the base table).  The
queries use the indexes fairly well, although I suspect that the order
of host/starttime is suboptimal (fewer hosts than starttime, and the
table is naturally in starttime order).  I'm going to try adding an
index on just starttime (and later just host) and see if I can tune the
queries on that more.  I never delete rows from the table, only do
inserts (up to around 11,000/minute mostly in one big burst every
minute, this is anticipated to go up some over time).  There are about
32 processes doing the inserts (on the same machine - yeah I know it'd
be happier if they moved); I think it might help if there was only one,
but for architectural reasons that won't happen for a while.

This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array
(ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd
tried FBSD 4/5 early on, but the insert speeds were actually better with
RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to
migrate easily now).

I'm trying to figure out ways to squeak another ounce or two of
performance out of this machine, I've included the things I've tuned so
far below.

The query processes are mostly stuck in D state so I expect that I'm
hitting some hw limitations, but I'm only doing sub 15MB from the disk
array (from iostat) and I know it can do in the 40-60MB range when we
tested the raw speed, and only 400 or so tps which is also well under
the arrays limits so I suspect that its thrashing a bit, this is also
indicated by the contrast between rrqm/s (read requests merged per
second) which is pushing 2000 and the actual r/s (read requests that
were issued to the device) at around 400 or so (same as tps).  I suspect
that a lot of the time is spent reading indexes, so a "better" indexing
scheme may be my best bet.

Estimating the table size
-------------------------

stats=> select relfilenode,relname from pg_class where
relfilenode=37057796;
relfilenode | relname
-------------+---------
   37057796 | tp3

du -sc 37057796* | grep total
234002372       total

However the two indexes are also - large (which may be part of the
problem, which is why I'm trying just starttime for an index; They are
currently in the 140-150G range).

The query optimizer thinks I have ~ 290M rows (I'm not actually doing a
real select count since the last time I did that was around 10M rows or
so and it took a long while, I don't want to wait days :).
------------------------------
stats=> explain select count(*) from tp3;
                             QUERY PLAN
-----------------------------------------------------------------------
Aggregate  (cost=7632998.20..7632998.20 rows=1 width=0)
  ->  Seq Scan on tp3  (cost=0.00..6906493.16 rows=290602016 width=0)
(2 rows)

Table def
----------------------------
stats=> \d tp3
                 Table "public.tp3"
  Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
host        | character(4)                |
point       | character varying(64)       |
type        | character(1)                |
cooked      | character(1)                |
starttime   | timestamp without time zone |
intervallen | interval                    |
arrivetime  | timestamp without time zone |
pvalue      | numeric                     |
Indexes:
   "tp3_host_starttime" btree (host, starttime, cooked)
   "tp3_point_starttime" btree (point, starttime, cooked)

Sample data mining query:
----------------------------
select point, avg(pvalue) as avg from tp3 where host in ('node', 'node',
....) and starttime between 'timestamp' and 'timestamp' group by point

Tuning done so far:
----------------------------
$ cat /etc/sysctl.conf

kernel.shmall=805306368
kernel.shmmax=805306368

$ egrep -v "^#|^$" postgresql.conf

shared_buffers = 60800

sort_mem = 1286720              # min 64, size in KB

vacuum_mem = 102400             # min 1024, size in KB

fsync=false   # Play fast and loose - whee

max_files_per_process = 1000

wal_buffers = 16

checkpoint_segments = 20        
checkpoint_timeout = 100

effective_cache_size = 160000

---------------------------(end of broadcast)---------------------------

TIP 3: if posting/reading through Usenet, please send an appropriate

     subscribe-nomail command to majordomo@postgresql.org so that your

     message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)