On Tue, Feb 9, 2010 at 11:51 PM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote: > Asher wrote: >> >> Once loaded into the database the data will never be deleted or modified >> and will typically be accessed over a particular date range for a particular >> channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A >> typical query won't return more than a few million rows and speed is not >> desperately important (as long as the time is measured in minutes rather >> than hours). >> >> Are there any recommended ways to organise this? Should I partition my big >> table into multiple smaller ones which will always fit in memory (this would >> result in several hundreds or thousands of sub-tables)? Are there any ways >> to keep the index size to a minimum? At the moment I have a few weeks of >> data, about 180GB, loaded into a single table and indexed on sample_time and >> channel and the index takes up 180GB too. > > One approach to consider is partitioning by sample_time and not even > including the channel number in the index. You've got tiny records; there's > going to be hundreds of channels of data on each data page pulled in, right? > Why not minimize physical I/O by reducing the index size, just read that > whole section of time in to memory (they should be pretty closely clustered > and therefore mostly sequential I/O), and then push the filtering by channel > onto the CPU instead. If you've got billions of rows, you're going to end > up disk bound anyway; minimizing physical I/O and random seeking around at > the expense of CPU time could be a big win. If they're put in in a one time load, load them in channel order into the partitions, and the stats should see the perfect ordering and know to seq scan the right part of the table. I think. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general