Re: Optimize date query for large child tables: GiST or GIN?

Stephen Frost <sfrost@xxxxxxxxxxx> · Fri, 21 May 2010 15:17:50 -0400

* Yeb Havinga (yebhavinga@xxxxxxxxx) wrote:
>> Normalizing by date parts was fast. Partitioning the tables by year  
>> won't do much good -- users will probably choose 1900 to 2009,  
>> predominately.
> Ok, in that case it is a bad idea.

Yeah, now that I understand what the user actually wants, I can
certainly understand that you wouldn't want to partition by year.  It
does strike me that perhaps you could partition by day ranges, but you'd
have to store them as something other than the 'date' type, which is
certainly frustrating, but you're not really operating on these in a
'normal' fashion as you would with a date.

The next question I would have, however, is if you could pre-aggregate
some of this data..  If users are going to typically use 1900-2009 for
years, then could the information about all of those years be aggregated
apriori to make those queries faster?

>> I thought about splitting the data by station by category, but that's  
>> ~73000 tables.

Do not get hung up on having to have a separate table for every unique
value in the column- you don't need that.  constraint_exclusion will
work just fine with ranges too- the problem is that you need to have
ranges that make sense with the data type you're using and with the
queries you're running.  That doesn't really work here with the
measurement_date, but it might work just fine with your station_id
field.

>> I also thought about splitting the data by station district by  
>> category -- there are 79 districts, yielding 474 child tables, which  
>> is ~575000 rows per child table. Most of the time I'd imagine only one  
>> or two districts would be selected. (Again, hard to know exactly.)

Also realize that PG will use multiple files for a single table once the
size of that table goes beyond 1G.

> I agee with Matthew Wakeling in a different post: its probably wise to  
> first see how fast things can get by using indexes. Only if that fails  
> to be fast, partitioning might be an option. (Though sequentially  
> scanning 0.5M rows is not cheap).

I would agree with this too- get it working first, then look at
partitioning.  Even more so- work on a smaller data set to begin with
while you're figuring out how to get the right answer in a generally
efficient way (not doing seq. scans through everything because you're
operating on every row for something).  It needs to be a couple
hundred-thousand rows, but it doesn't need to be the full data set, imv.

	Thanks,

		Stephen
Attachment:
signature.asc

Description: Digital signature