sc.taken_end <= '1996-12-31'::date AND
m.taken BETWEEN sc.taken_start AND sc.taken_end ANDcategory of data at a certain time. But I'm afraid this makes the planning
much more difficult, as the select from measurements depend on the data
returned by other parts of the query (rows from category).
Right. Users can select 1900 - 2009. Station data hardly ever spans that range.
The station_category is used to create a unique key into the measurement data for every station: station_id, category_id, and taken_start. The measurement data should be contiguous until taken_end.
I thought that that combination would be a pointer to the exact spot in the measurement table where the data starts, which should be ridiculously fast to find.
See this http://explain.depesz.com/s/H1 and this
http://explain.depesz.com/s/GGx
I was getting some red lines when I looked at a different plan. It's a great site.
How many rows does the query return without the group by clause? About
140000 in both cases, right?
SELECT
*
FROM
climate.measurement m
WHERE
m.station_id = 5148 AND
m.taken BETWEEN '1900-08-01'::date AND '2009-12-31'::date AND
m.category_id = 1
*
FROM
climate.measurement m
WHERE
m.station_id = 5148 AND
m.taken BETWEEN '1900-08-01'::date AND '2009-12-31'::date AND
m.category_id = 1
5397 rows (10 seconds cold; 0.5 seconds hot); estimated too high by 2275 rows?
OK, I haven't noticed the table is already partitioned by category_id and
I didn't mean to partition by (taken, category_id) - that would produce a
lot of partitions. Yes, that might cause problems related to number of
files, but that's rather a filesystem related issue.
Constrained as:
CONSTRAINT measurement_013_category_id_ck CHECK (category_id = 7)
I'd expect rather issues related to RULEs or triggers (not sure which of
them you use to redirect the data into partitions). But when partitioning
I created seven child tables of measurement. Each of these has a constraint by category_id. This makes it extremely fast to select the correct partition.
I'm not sure what you mean by 'year wrapping issue' but I think it might
work quite well - right not the problem is PostgreSQL decides to scan the
whole partition (all data for a given category_id).
I'll give it another try. :-)
Use Case #1
User selects: Mar 22 to Dec 22
User selects: 1900 to 2009
Result: Query should average 9 months of climate data per year between Mar 22 and Dec 22 of Year.
Use Case #2
User selects: 1900 to 2009
Result: Query should average 3 months of climate data per year between Dec 22 of Year and Mar 22 of Year+1.
So if a user selects 1950 to 1960:
- first case should average between 1950 and 1960; and
- second case should average between 1950 and 1961.