Re: Date Range Using Months and Days Only

Jeff Adams <Jeff.Adams@xxxxxxxx> · Thu, 20 Oct 2011 11:47:31 -0400

This seems like it will be much faster, though I will most likely create epoch ranges and forgo creating timestamps for all of my epoch values in my table of 100+ million rows. Thanks again Adam!

Jeff

From: Adam Cornett [mailto:adam.cornett@xxxxxxxxx] 
Sent: Thursday, October 20, 2011 11:37 AM
To: Jeff Adams
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  Date Range Using Months and Days Only

On Thu, Oct 20, 2011 at 8:42 AM, Jeff Adams <Jeff.Adams@xxxxxxxx> wrote:
Thanks for the reply and assistance. I share your concern that the approach
may be slow. I am not adverse to creating some sort of table to store the
time periods with years for the temporal extent of the data if that would
speed up the process. In that situation there would be multiple records (one
record for each year) that it would need to check against. I suppose I could
then create some sort of aggregate to then count the number of records in
which the date falls within the range. Anything over 0 would indicate that
the date does fall within the range?

From: Adam Cornett [mailto:adam.cornett@xxxxxxxxx]
Sent: Wednesday, October 19, 2011 6:51 PM
To: Jeff Adams
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  Date Range Using Months and Days Only

On Wed, Oct 19, 2011 at 12:40 PM, Jeff Adams <Jeff.Adams@xxxxxxxx> wrote:
Greetings,

I have to write a query on a fairly large table of data (>100 million rows)
where I need to check to see if a date (epoch) falls between a range of
values. The catch is that the range is defined only by month and day values.
For example the record containing the epoch value will be linked to a table
that containing columns named start_month, start_day, end_month, end_day
that define the range. With respect to the range, year does not matter,
however, some of the ranges will start in November and end in April of the
next year. Has anyone come across this type of query? I could certainly
write a function or even include criteria in a query that would extract date
parts of the epoch and then compare against the values in the start_month,
start_day, end_month, end_day (it might get complex with respect to ranges
where the start year and end year are different), but I am worried about
performance. I thought I would seek some input before I floundered through
the many iterations of poor performing alternatives! Any thoughts would be
greatly appreciated.

Thanks in advance...
Jeff

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

create table a (
    id_a integer,
    epoch integer
);

create table b (
    id_b integer,
    start_month integer,
    start_day integer,
    end_month integer,
    end_day integer
);
CREATE OR REPLACE FUNCTION get_timestamps(sm integer, sd integer, em
integer, ed integer, year integer)
  RETURNS timestamp without time zone[] AS
$BODY$
declare
    syear integer := year;
    eyear integer := year;
    tstamps timestamp[];
begin
    if(sm>em) then
     -- assume that since the end month is less than the start month is in
the next year
     eyear := eyear+1;
    end if;
    tstamps[0] := (syear|| '-' || sm || '-' || sd)::timestamp;
    tstamps[1] := (eyear|| '-' || em || '-' || ed)::timestamp;
    return tstamps;

end
$BODY$
  LANGUAGE plpgsql STABLE;

create view a_timestamp as
SELECT id_a, TIMESTAMP WITH TIME ZONE 'epoch' + epoch * INTERVAL '1 second'
as tstamp from a;

with ab as(
select
id_a,id_b,get_timestamps(b.start_month,b.start_day,b.end_month,b.end_day,
extract(year from a.tstamp)::integer) ts_arr,a.tstamp
 from a_timestamp as a,b
)
select * from ab
where ab.tstamp between ts_arr[0] and ts_arr[1]

This obviously isn't a fast solution to your problem, although converting
the integer epoch to a timestamp in table a would eliminate the view
a_timestamp and you can index the column for some speed up, the real problem
you're facing is that your ranges don't have years, otherwise you could
store everything as a timestamp and then just join using 'between' and
postgres would just need to do an index scan on each table.

-Adam

Here is an updated method, it involves creating a third table to store the timestamp values from table 'b' to allow for better lookups, I've also added a timestamp column to table 'a' (which can be calculated from the epoch using the view in my previous email. 

create table a (
    id_a integer,
    epoch integer,
    ts timestamp
);

CREATE INDEX  ON a (ts);

create table b (
    id_b integer,
    start_month integer,
    start_day integer,
    end_month integer,
    end_day integer
);
CREATE OR REPLACE FUNCTION get_timestamps(sm integer, sd integer, em integer, ed integer, year integer)
  RETURNS timestamp without time zone[] AS
$BODY$
declare
    syear integer := year;
    eyear integer := year;
    tstamps timestamp[];
begin
    if(sm>em) then
     -- assume that since the end month is less than the start month is in the next year
     eyear := eyear+1;
    end if;
    tstamps[0] := (syear|| '-' || sm || '-' || sd)::timestamp;
    tstamps[1] := (eyear|| '-' || em || '-' || ed)::timestamp;
    return tstamps;

end
$BODY$
  LANGUAGE plpgsql STABLE;

create table b_calc (
id_b integer,
year integer,
range_start timestamp,
range_end timestamp
);

CREATE INDEX ON b_calc (range_start, range_end);

-- generate the ranges for the last 10 years
insert into b_calc 
with bb as(
select id_b,get_timestamps(b.start_month,b.start_day,b.end_month,b.end_day, yr) ts_arr, yr
 from b, generate_series(1991,2011) as yr
)
select id_b,yr,ts_arr[0],ts_arr[1] from bb;

-- the actual query; explain shows that it uses the two indexes to match up the ranges

select id_a,id_b from a, b_calc b where a.ts between b.range_start and b.range_end