* Yeb Havinga (yebhavinga@xxxxxxxxx) wrote: >> Normalizing by date parts was fast. Partitioning the tables by year >> won't do much good -- users will probably choose 1900 to 2009, >> predominately. > Ok, in that case it is a bad idea. Yeah, now that I understand what the user actually wants, I can certainly understand that you wouldn't want to partition by year. It does strike me that perhaps you could partition by day ranges, but you'd have to store them as something other than the 'date' type, which is certainly frustrating, but you're not really operating on these in a 'normal' fashion as you would with a date. The next question I would have, however, is if you could pre-aggregate some of this data.. If users are going to typically use 1900-2009 for years, then could the information about all of those years be aggregated apriori to make those queries faster? >> I thought about splitting the data by station by category, but that's >> ~73000 tables. Do not get hung up on having to have a separate table for every unique value in the column- you don't need that. constraint_exclusion will work just fine with ranges too- the problem is that you need to have ranges that make sense with the data type you're using and with the queries you're running. That doesn't really work here with the measurement_date, but it might work just fine with your station_id field. >> I also thought about splitting the data by station district by >> category -- there are 79 districts, yielding 474 child tables, which >> is ~575000 rows per child table. Most of the time I'd imagine only one >> or two districts would be selected. (Again, hard to know exactly.) Also realize that PG will use multiple files for a single table once the size of that table goes beyond 1G. > I agee with Matthew Wakeling in a different post: its probably wise to > first see how fast things can get by using indexes. Only if that fails > to be fast, partitioning might be an option. (Though sequentially > scanning 0.5M rows is not cheap). I would agree with this too- get it working first, then look at partitioning. Even more so- work on a smaller data set to begin with while you're figuring out how to get the right answer in a generally efficient way (not doing seq. scans through everything because you're operating on every row for something). It needs to be a couple hundred-thousand rows, but it doesn't need to be the full data set, imv. Thanks, Stephen
Attachment:
signature.asc
Description: Digital signature