Re: Optimize date query for large child tables: GiST or GIN?

David Jarvis <thangalin@xxxxxxxxx> · Sat, 22 May 2010 10:54:14 -0700

Hi,

certainly understand that you wouldn't want to partition by year.  It

Definitely not.

does strike me that perhaps you could partition by day ranges, but you'd

I don't think that will work; users can choose any day range, with the most common as Jan 1 - Dec 31, followed by seasonal ranges, followed by arbitrary ranges.

some of this data..  If users are going to typically use 1900-2009 for

years, then could the information about all of those years be aggregated

apriori to make those queries faster?

I'm not sure what you mean. I could create a separate table that lumps the aggregated averages per year per station per category, but that will only help in the one case. There are five different reporting pages (Basic through Guru). On three of those pages the user must select arbitrary day ranges. On one of those pages, the user can select a season, which then maps to, for all intents and purposes, an arbitrary day range.

Only the most basic page do not offer the user a day range selection.

Do not get hung up on having to have a separate table for every unique

value in the column- you don't need that.  constraint_exclusion will

That's good advice. I have repartitioned the data into seven tables: one per category.

 I agee with Matthew Wakeling in a different post: its probably wise to
I would agree with this too- get it working first, then look at

partitioning.  Even more so- work on a smaller data set to begin with

The query speed has now much improved thanks to everybody's advice.

From a cost of 10006220141 down to 704924. Here is the query:

SELECT
  avg(m.amount),
  extract(YEAR FROM m.taken) AS year_taken
FROM
  climate.city c,
  climate.station s,
  climate.measurement m
WHERE
  c.id = 5182 AND

  6371.009 * SQRT(
    POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
    (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
     POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))

  ) <= 25 AND
  s.elevation BETWEEN 0 AND 3000 AND
  m.category_id = 7 AND
  m.station_id = s.id AND
  extract(YEAR FROM m.taken) BETWEEN 1900 AND 2000
GROUP BY
  extract(YEAR FROM m.taken)

ORDER BY
  extract(YEAR FROM m.taken)

(Note that extract(YEAR FROM m.taken) is much faster than date_part('year'::text, m.taken).)

The query plan for the above SQL reveals:

"Sort  (cost=704924.25..704924.75 rows=200 width=9) (actual time=9476.518..9476.521 rows=46 loops=1)"
"  Sort Key: (date_part('year'::text, (m.taken)::timestamp without time zone))"
"  Sort Method:  quicksort  Memory: 28kB"

"  ->  HashAggregate  (cost=704913.10..704916.60 rows=200 width=9) (actual time=9476.465..9476.489 rows=46 loops=1)"
"        ->  Hash Join  (cost=1043.52..679956.79 rows=4991262 width=9) (actual time=46.399..9344.537 rows=120678 loops=1)"

"              Hash Cond: (m.station_id = s.id)"
"              ->  Append  (cost=0.00..529175.42 rows=14973786 width=13) (actual time=0.076..7739.647 rows=14874909 loops=1)"

"                    ->  Seq Scan on measurement m  (cost=0.00..43.00 rows=1 width=20) (actual time=0.000..0.000 rows=0 loops=1)"
"                          Filter: ((category_id = 7) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2000::double precision))"

"                    ->  Index Scan using measurement_013_taken_year_idx on measurement_013 m  (cost=0.01..529132.42 rows=14973785 width=13) (actual time=0.075..6266.385 rows=14874909 loops=1)"
"                          Index Cond: ((date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2000::double precision))"

"                          Filter: (category_id = 7)"
"              ->  Hash  (cost=992.94..992.94 rows=4046 width=4) (actual time=43.420..43.420 rows=78 loops=1)"
"                    ->  Nested Loop  (cost=0.00..992.94 rows=4046 width=4) (actual time=0.053..43.390 rows=78 loops=1)"

"                          Join Filter: ((6371.009::double precision * sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double precision), 2::double precision) + (cos((radians(((c.latitude_decimal + s.latitude_decimal))::double precision) / 2::double precision)) * pow(radians(((c.longitude_decimal - s.longitude_decimal))::double precision), 2::double precision))))) <= 25::double precision)"

"                          ->  Index Scan using city_pkey1 on city c  (cost=0.00..4.27 rows=1 width=16) (actual time=0.021..0.022 rows=1 loops=1)"
"                                Index Cond: (id = 5182)"

"                          ->  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20) (actual time=0.008..5.457 rows=12139 loops=1)"
"                                Filter: ((s.elevation >= 0) AND (s.elevation <= 3000))"

"Total runtime: 9476.626 ms"

That's about 10 seconds using the category with the smallest table. The largest table takes 17 seconds (fantastic!) after a few runs and 85 seconds cold. About 1 second is my goal, before the pending hardware upgrades.

When I recreated the tables, I sorted them by date then station id so that there is now a 1:1 correlation between the sequence number and the measurement date.

Would clustering on the date and station make a difference?

Is there any other way to index the data that I have missed?

Thank you very much.

Dave