Hello, and thank you in advance.
Beyond the date vs timestamp troubleshooting I did, I am not sure what else to look for, I know the increase of rows will have some affect but I just don't think the query should go from 4 minutes to over 50.
system:
laptop - ubuntu 12.04 lts
"PostgreSQL 9.2.1 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit"
Summary:
I have two tables. I run a function called massive_expansion that will look at customer_id and the_range - it will use these to query the table massive to find all those dates for the customer_id that are in the_range and insert these into another table.
Problem:
The table massive is about 65 million rows (about double what it was) The query now takes 50 minutes (it used to take 4 minutes with data 1/2 the size).
History:
When (I first started out) I had everything as date the query went quite quickly - 4 minutes when the table massive is about 30 million rows.
I then re-did everything but this time instead of date I used the actual timestamps (as a result there were a few more rows but only 2 million more) - the query went quickly - 5 minutes when the table massive is about 32 million rows.
So I did not see an appreciable difference between the data having date vs timestamp (all other things kept the same.) So I supposed I can rule out date vs timestamp being an issue.
tsrange is always an interval of 7 days.
I have two tables:
First table:
CREATE TABLE massive
(
source character varying,
dateof timestamp without time zone,
customer_id bigint,
count_raw bigint
);
CREATE INDEX customer_id_dateof
ON massive
USING btree
(customer_id, dateof);
Second table:
CREATE TABLE critical_visitors
(
customer_id bigint,
dateof timestamp without time zone,
the_range tsrange
);
CREATE INDEX customer_id_range
ON critical_visitors
USING btree
(customer_id, the_range);
The function:
CREATE or replace FUNCTION massive_expansion(customer_id bigint,the_range tsrange) RETURNS void AS $$
BEGIN
INSERT INTO massive_expansion_from_critical_visitors
(
select
massive.source,
massive.dateof,
massive.customer_id,
massive.count_raw
from
massive
where
massive.customer_id = $1
and
massive.dateof <@ the_range) ;
END;
$$ LANGUAGE plpgsql;
The query:
select
massive_expansion(customer_id,the_range)
from
critical_visitors;
Additional Detail:
I did a query against the 30 million volume with this query:
select
massive.source,
massive.dateof,
massive.customer_id,
massive.count_raw
from
massive
where
massive.customer_id = '<a customer_id goes here>'::bigint--$1
and
massive.dateof <@ '(2012-07-22 17:00:00,2012-07-29 17:00:00]'::tsrange;
With a query plan of:
"Index Scan using customer_id_sourcee on massive_m (cost=0.00..113.98 rows=1 width=28)"
" Index Cond: (customer_id = 9167174563::bigint)"
" Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29 17:00:00"]'::tsrange)"
Then on the 65 million volume table I did the same query and got a plan of:
"Index Scan using customer_id_source on massive (cost=0.00..189.38 rows=1 width=28)"
" Index Cond: (customer_id = 9167174563::bigint)"
" Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29 17:00:00"]'::tsrange)"