query performance, though it was timestamps,maybe just table size?

Henry Drexler <alonup8tb@xxxxxxxxx> · Fri, 30 Nov 2012 08:22:07 -0500

Hello, and thank you in advance.

Beyond the date vs timestamp troubleshooting I did, I am not sure what else to look for, I know the increase of rows will have some affect but I just don't think the query should go from 4 minutes to over 50. 

system:

laptop - ubuntu 12.04 lts

"PostgreSQL 9.2.1 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit"

Summary:
    I have two tables.  I run a function called massive_expansion that will look at customer_id and the_range - it will use these to query the table massive to find all those dates for the customer_id that are in the_range and insert these into another table.

Problem:
    The table massive is about 65 million rows (about double what it was) The query now takes 50 minutes (it used to take 4 minutes with data 1/2 the size).

History:
    When (I first started out) I had everything as date the query went quite quickly - 4 minutes when the table massive is about 30 million rows.

    I then re-did everything but this time instead of date I used the actual timestamps (as a result there were a few more rows but only 2 million more) - the query went quickly - 5 minutes when the table massive is about 32 million rows.

    So I did not see an appreciable difference between the data having date vs timestamp (all other things kept the same.)  So I supposed I can rule out date vs timestamp being an issue.

tsrange is always an interval of 7 days.

I have two tables:

First table:
    CREATE TABLE massive
    (
      source character varying,
      dateof timestamp without time zone,
      customer_id bigint,
      count_raw bigint
    );

    CREATE INDEX customer_id_dateof
      ON massive
      USING btree
      (customer_id, dateof);

Second table:
    CREATE TABLE critical_visitors
    (
      customer_id bigint,
      dateof timestamp without time zone,

      the_range tsrange
    );

    CREATE INDEX customer_id_range
      ON critical_visitors
      USING btree
      (customer_id, the_range);

The function:
    CREATE or replace FUNCTION massive_expansion(customer_id bigint,the_range tsrange) RETURNS void AS $$

        BEGIN

        INSERT INTO massive_expansion_from_critical_visitors
        (
        select
        massive.source,
        massive.dateof,
        massive.customer_id,
        massive.count_raw

        from
        massive
        where
        massive.customer_id = $1
        and
        massive.dateof <@ the_range) ;

        END;
    $$ LANGUAGE plpgsql;

The query:
    select

    massive_expansion(customer_id,the_range)
    from
    critical_visitors;

Additional Detail:
I did a query against the 30 million volume with this query:
    select
    massive.source,

    massive.dateof,
    massive.customer_id,
    massive.count_raw
    from
    massive
    where
    massive.customer_id = '<a customer_id goes here>'::bigint--$1
    and
    massive.dateof <@ '(2012-07-22 17:00:00,2012-07-29 17:00:00]'::tsrange;

With a query plan of:
    "Index Scan using customer_id_sourcee on massive_m  (cost=0.00..113.98 rows=1 width=28)"
    "  Index Cond: (customer_id = 9167174563::bigint)"
    "  Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29 17:00:00"]'::tsrange)"

Then on the 65 million volume table I did the same query and got a plan of:
    "Index Scan using customer_id_source on massive  (cost=0.00..189.38 rows=1 width=28)"
    "  Index Cond: (customer_id = 9167174563::bigint)"

    "  Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29 17:00:00"]'::tsrange)"