Based on the queries you run, is it possible to split up the schema into different tables? Are the differences between timestamps in the sample query usually small? We had a similar problem, although with a slightly smaller data set -- but one that was going to keep growing.
Our questions were: how could we scale? What about vacuuming our tables, running analyze in a decent time? backing up? and so on. We found that most of the queries we wanted were in the domain of a day. So we actually split up that giant table and made one for each day. We could have done it one for each week as well, but the daily tables worked well for us. Sure, its a bit more work getting data over a long time period, but those common queries were a cinch.
We've also seen that in cases were we have to dump in thousands of records every few minutes that select queries respond remarkedly faster when frequent (one or two every hour) ANALYZE's are done even on those daily tables which contain just a few mil records each.
Tweaking the hardware IMHO would probably take you just a little further, but you gotta think about what your response times will be in another month based on your growth now. Can your schema stand it then?
- Ericson Smith
Mooney, Ryan wrote:
Hello,
I have a single table that just went over 234GB in size with about 290M+ rows. I think that I'm starting to approach some limits since things have gotten quite a bit slower over the last couple days. The table is really simple and I'm mostly doing simple data mining queries like the query included below. These have gone from taking a under a few minutes to taking tens of minutes over the last week or so (a query like the below would generally return 1-6 million rows from the base table). The queries use the indexes fairly well, although I suspect that the order of host/starttime is suboptimal (fewer hosts than starttime, and the table is naturally in starttime order). I'm going to try adding an index on just starttime (and later just host) and see if I can tune the queries on that more. I never delete rows from the table, only do inserts (up to around 11,000/minute mostly in one big burst every minute, this is anticipated to go up some over time). There are about 32 processes doing the inserts (on the same machine - yeah I know it'd be happier if they moved); I think it might help if there was only one, but for architectural reasons that won't happen for a while.
This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array (ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd tried FBSD 4/5 early on, but the insert speeds were actually better with RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to migrate easily now).
I'm trying to figure out ways to squeak another ounce or two of performance out of this machine, I've included the things I've tuned so far below.
The query processes are mostly stuck in D state so I expect that I'm hitting some hw limitations, but I'm only doing sub 15MB from the disk array (from iostat) and I know it can do in the 40-60MB range when we tested the raw speed, and only 400 or so tps which is also well under the arrays limits so I suspect that its thrashing a bit, this is also indicated by the contrast between rrqm/s (read requests merged per second) which is pushing 2000 and the actual r/s (read requests that were issued to the device) at around 400 or so (same as tps). I suspect that a lot of the time is spent reading indexes, so a "better" indexing scheme may be my best bet.
Estimating the table size -------------------------
stats=> select relfilenode,relname from pg_class where relfilenode=37057796; relfilenode | relname -------------+--------- 37057796 | tp3
du -sc 37057796* | grep total 234002372 total
However the two indexes are also - large (which may be part of the problem, which is why I'm trying just starttime for an index; They are currently in the 140-150G range).
The query optimizer thinks I have ~ 290M rows (I'm not actually doing a real select count since the last time I did that was around 10M rows or so and it took a long while, I don't want to wait days :). ------------------------------ stats=> explain select count(*) from tp3; QUERY PLAN ----------------------------------------------------------------------- Aggregate (cost=7632998.20..7632998.20 rows=1 width=0) -> Seq Scan on tp3 (cost=0.00..6906493.16 rows=290602016 width=0) (2 rows)
Table def ---------------------------- stats=> \d tp3 Table "public.tp3" Column | Type | Modifiers -------------+-----------------------------+----------- host | character(4) | point | character varying(64) | type | character(1) | cooked | character(1) | starttime | timestamp without time zone | intervallen | interval | arrivetime | timestamp without time zone | pvalue | numeric | Indexes: "tp3_host_starttime" btree (host, starttime, cooked) "tp3_point_starttime" btree (point, starttime, cooked)
Sample data mining query: ---------------------------- select point, avg(pvalue) as avg from tp3 where host in ('node', 'node', ....) and starttime between 'timestamp' and 'timestamp' group by point
Tuning done so far: ---------------------------- $ cat /etc/sysctl.conf
kernel.shmall=805306368 kernel.shmmax=805306368
$ egrep -v "^#|^$" postgresql.conf
shared_buffers = 60800
sort_mem = 1286720 # min 64, size in KB
vacuum_mem = 102400 # min 1024, size in KB
fsync=false # Play fast and loose - whee
max_files_per_process = 1000
wal_buffers = 16
checkpoint_segments = 20 checkpoint_timeout = 100
effective_cache_size = 160000
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)