On Fri, 2006-01-27 at 20:23 -0500, Mike Biamonte wrote: > > Does anyone have any experience with extremely large data sets? > I'm mean hundreds of millions of rows. Sure, I think more than a few of us do. Just today I built a summary table from a 25GB primary table with ~430 million rows. This took about 45 minutes. > The queries I need to run on my 200 million transactions are relatively > simple: > > select month, count(distinct(cardnum)) count(*), sum(amount) from > transactions group by month; > > This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with > RAID-10 (15K drives) > and 12 GB Ram. I was expecting it to take about 4 hours - based on some > experience with a > similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM, > Raid-5 10K drives) Possibly the latter machine has a faster I/O subsystem. How large is the table on disk? > This machine is COMPLETELY devoted to running these relatively simple > queries one at a > time. (No multi-user support needed!) I've been tooling with the various > performance settings: > effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB > each. > ( Shared buffers puzzles me a it bit - my instinct says to set it as high as > possible, > but everything I read says that "too high" can hurt performance.) > > Any ideas for performance tweaking in this kind of application would be > greatly appreciated. > We've got indexes on the fields being grouped, > and always vacuum analzye > after building them. Probably vacuum makes no difference. > It's difficult to just "try" various ideas because each attempt takes a > full day to test. Real > experience is needed here! Can you send us an EXPLAIN of the query? I believe what you're seeing here is probably: Aggregate +-Sort +-Sequential Scan or perhaps: Aggregate +-Index Scan I have a feeling that the latter will be much faster. If your table has been created over time, then it is probably naturally ordered by date, and therefore also ordered by month. You might expect a Sequential Scan to be the fastest, but the Sort step will be a killer. On the other hand, if your table is badly disordered by date, the Index Scan could also be very slow. Anyway, send us the query plan and also perhaps a sample of vmstat during the query. For what it's worth, I have: effective_cache_size | 700000 cpu_tuple_cost | 0.01 cpu_index_tuple_cost | 0.001 random_page_cost | 3 shared_buffers | 50000 temp_buffers | 1000 work_mem | 1048576 <= for this query only And here's a few lines from vmstat during the query: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 76 43476 94916 7655148 0 0 78800 0 1662 788 68 12 0 20 1 1 76 45060 91196 7658088 0 0 78028 0 1639 712 71 11 0 19 2 0 76 44668 87624 7662960 0 0 78924 52 1650 736 69 12 0 19 2 0 76 45300 83672 7667432 0 0 83536 16 1688 768 71 12 0 18 1 1 76 45744 80652 7670712 0 0 84052 0 1691 796 70 12 0 17 That's about 80MB/sec sequential input, for comparison purposes. -jwb