This seems like it will be much faster, though I will most likely create epoch ranges and forgo creating timestamps for all of my epoch values in my table of 100+ million rows. Thanks again Adam! Jeff From: Adam Cornett [mailto:adam.cornett@xxxxxxxxx] Sent: Thursday, October 20, 2011 11:37 AM To: Jeff Adams Cc: pgsql-general@xxxxxxxxxxxxxx Subject: Re: Date Range Using Months and Days Only On Thu, Oct 20, 2011 at 8:42 AM, Jeff Adams <Jeff.Adams@xxxxxxxx> wrote: Thanks for the reply and assistance. I share your concern that the approach may be slow. I am not adverse to creating some sort of table to store the time periods with years for the temporal extent of the data if that would speed up the process. In that situation there would be multiple records (one record for each year) that it would need to check against. I suppose I could then create some sort of aggregate to then count the number of records in which the date falls within the range. Anything over 0 would indicate that the date does fall within the range?
From: Adam Cornett [mailto:adam.cornett@xxxxxxxxx] Sent: Wednesday, October 19, 2011 6:51 PM To: Jeff Adams Cc: pgsql-general@xxxxxxxxxxxxxx Subject: Re: Date Range Using Months and Days Only
On Wed, Oct 19, 2011 at 12:40 PM, Jeff Adams <Jeff.Adams@xxxxxxxx> wrote: Greetings,
I have to write a query on a fairly large table of data (>100 million rows) where I need to check to see if a date (epoch) falls between a range of values. The catch is that the range is defined only by month and day values. For example the record containing the epoch value will be linked to a table that containing columns named start_month, start_day, end_month, end_day that define the range. With respect to the range, year does not matter, however, some of the ranges will start in November and end in April of the next year. Has anyone come across this type of query? I could certainly write a function or even include criteria in a query that would extract date parts of the epoch and then compare against the values in the start_month, start_day, end_month, end_day (it might get complex with respect to ranges where the start year and end year are different), but I am worried about performance. I thought I would seek some input before I floundered through the many iterations of poor performing alternatives! Any thoughts would be greatly appreciated.
Thanks in advance... Jeff
-- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
create table a ( id_a integer, epoch integer );
create table b ( id_b integer, start_month integer, start_day integer, end_month integer, end_day integer ); CREATE OR REPLACE FUNCTION get_timestamps(sm integer, sd integer, em integer, ed integer, year integer) RETURNS timestamp without time zone[] AS $BODY$ declare syear integer := year; eyear integer := year; tstamps timestamp[]; begin if(sm>em) then -- assume that since the end month is less than the start month is in the next year eyear := eyear+1; end if; tstamps[0] := (syear|| '-' || sm || '-' || sd)::timestamp; tstamps[1] := (eyear|| '-' || em || '-' || ed)::timestamp; return tstamps;
end $BODY$ LANGUAGE plpgsql STABLE;
create view a_timestamp as SELECT id_a, TIMESTAMP WITH TIME ZONE 'epoch' + epoch * INTERVAL '1 second' as tstamp from a;
with ab as( select id_a,id_b,get_timestamps(b.start_month,b.start_day,b.end_month,b.end_day, extract(year from a.tstamp)::integer) ts_arr,a.tstamp from a_timestamp as a,b ) select * from ab where ab.tstamp between ts_arr[0] and ts_arr[1]
This obviously isn't a fast solution to your problem, although converting the integer epoch to a timestamp in table a would eliminate the view a_timestamp and you can index the column for some speed up, the real problem you're facing is that your ranges don't have years, otherwise you could store everything as a timestamp and then just join using 'between' and postgres would just need to do an index scan on each table.
-Adam
Here is an updated method, it involves creating a third table to store the timestamp values from table 'b' to allow for better lookups, I've also added a timestamp column to table 'a' (which can be calculated from the epoch using the view in my previous email.
CREATE OR REPLACE FUNCTION get_timestamps(sm integer, sd integer, em integer, ed integer, year integer) RETURNS timestamp without time zone[] AS -- assume that since the end month is less than the start month is in the next year tstamps[0] := (syear|| '-' || sm || '-' || sd)::timestamp; tstamps[1] := (eyear|| '-' || em || '-' || ed)::timestamp; CREATE INDEX ON b_calc (range_start, range_end); -- generate the ranges for the last 10 years select id_b,get_timestamps(b.start_month,b.start_day,b.end_month,b.end_day, yr) ts_arr, yr from b, generate_series(1991,2011) as yr select id_b,yr,ts_arr[0],ts_arr[1] from bb; -- the actual query; explain shows that it uses the two indexes to match up the ranges select id_a,id_b from a, b_calc b where a.ts between b.range_start and b.range_end |