Re: Optimize date query for large child tables: GiST or GIN?

Yeb Havinga <yebhavinga@xxxxxxxxx> · Fri, 21 May 2010 10:38:43 +0200

David Jarvis wrote: 

    Also, you're trying to do constraint_exclusion, but have you made sure
    that it's turned on?  And have you made sure that those
    constraints are
    really the right ones and that they make sense?  You're using a
    bunch of
    extract()'s there too, why not just specify a CHECK constraint on the
    date ranges which are allowed in the table..?

I don't know what the date ranges are? So I can't partition them by year?

Right now I created 72 child tables by using the category and month. 
This may have been a bad choice. But at least all the data is in the 
system now so dissecting or integrating it back in different ways 
shouldn't take days.

Thanks everyone for all your help, I really appreciate the time you've 
taken to guide me in the right direction to make the system as fast as 
it can be.

My $0.02 - its hard to comment inline due to the number of responses, 
but: the partitioning is only useful for speed, if it matches how your 
queries select data. For time based data I would for sure go for year 
based indexing. If you want a fixed number of partitions, you could 
perhaps do something like year % 64. I did a test to see of the 
constraint exclusion could work with extract but that failed:

test=# create table parent(t timestamptz);
test=# create table child1(check ((extract(year from t)::int % 2)=0)) 
inherits( parent);
test=# create table child2(check ((extract(year from t)::int % 2)=1)) 
inherits(parent);
test=# explain select * from parent where (extract(year from t)::int % 
2) = 0;
                               QUERY PLAN                                
---------------------------------------------------------------------------
Result  (cost=0.00..158.40 rows=33 width=8)
  ->  Append  (cost=0.00..158.40 rows=33 width=8)
        ->  Seq Scan on parent  (cost=0.00..52.80 rows=11 width=8)
              Filter: (((date_part('year'::text, t))::integer % 2) = 0)
        ->  Seq Scan on child1 parent  (cost=0.00..52.80 rows=11 width=8)
              Filter: (((date_part('year'::text, t))::integer % 2) = 0)
        ->  Seq Scan on child2 parent  (cost=0.00..52.80 rows=11 width=8)
              Filter: (((date_part('year'::text, t))::integer % 2) = 0)

It hits all partitions even when I requested for a single year.

So an extra column would be needed, attempt 2 with added year smallint.

test=# create table parent(t timestamptz, y smallint);
test=# create table child1(check ((y % 2)=0)) inherits( parent);
test=# create table child2(check ((y % 2)=1)) inherits( parent);
test=# explain select * from parent where (y % 2) between 0 and 0;
                                  QUERY 
PLAN                                   
---------------------------------------------------------------------------------
Result  (cost=0.00..122.00 rows=20 width=10)
  ->  Append  (cost=0.00..122.00 rows=20 width=10)
        ->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer % 
2) <= 0))
        ->  Seq Scan on child1 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer % 
2) <= 0))

This works: only one child table hit.

That made me think: if you'd scan two consecutive years, you'd always 
hit two different partitions. For your use case it'd be nice if some 
year wraparounds would fall in the same partition. The following query 
shows partition numbers for 1900 - 2010 with 4 consecutive years in the 
same partition. It also shows that in this case 32 partitions is enough:

test=# select x, (x / 4) % 32 from generate_series(1900,2010) as x(x);
x   | ?column?
------+----------
1900 |       27
1901 |       27
1902 |       27
1903 |       27
1904 |       28
1905 |       28
etc
1918 |       31
1919 |       31
1920 |        0
1921 |        0
etc
2005 |       21
2006 |       21
2007 |       21
2008 |       22
2009 |       22
2010 |       22
(111 rows)

This would mean that a extra smallint column is needed which would 
inflate the 300M relation with.. almost a GB, but I still think it'd be 
a good idea.

create or replace function yearmod(int) RETURNS int
as 'select (($1 >> 2) %32);'
language sql
immutable
strict;

create table parent(t timestamptz, y smallint);

select 'create table child'||x||'(check (yearmod(y)='||x-1||')) 
inherits(parent);' from generate_series(1,32) as x(x);
                          ?column?                           
---------------------------------------------------------------
create table child1(check (yearmod(y)=0)) inherits(parent);
create table child2(check (yearmod(y)=1)) inherits(parent);
create table child3(check (yearmod(y)=2)) inherits(parent);
etc
create table child30(check (yearmod(y)=29)) inherits(parent);
create table child31(check (yearmod(y)=30)) inherits(parent);
create table child32(check (yearmod(y)=31)) inherits(parent);
(32 rows)

Copy and paste output of this query in psql to create child tables.

Example with period 1970 to 1980:

test=# explain select * from parent where yearmod(y) between 
yearmod(1970) and yearmod(1980);
                                QUERY 
PLAN                                 
-----------------------------------------------------------------------------
Result  (cost=0.00..305.00 rows=50 width=10)
  ->  Append  (cost=0.00..305.00 rows=50 width=10)
        ->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child13 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child14 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child15 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child16 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
(12 rows)

This works: query for 11 consecutive years hits only 4 from 31.

But the between fails for yearmods that wrap the 31 boundary, what 
happens here between 1910 and 1920

test=# explain select * from parent where yearmod(y) between 
yearmod(1910) and yearmod(1920);
               QUERY PLAN               
------------------------------------------
Result  (cost=0.00..0.01 rows=1 width=0)
  One-Time Filter: false
(2 rows)

So for the wraparound case we need a CASE:

test=# explain select * from parent where case when yearmod(1910) <= 
yearmod(1920)
then yearmod(y) between yearmod(1910) and yearmod(1920)
else (yearmod(y) >= yearmod(1910) or yearmod(y) <= yearmod(1920)) end;
                                 QUERY 
PLAN                                  
-------------------------------------------------------------------------------
Result  (cost=0.00..305.00 rows=5665 width=10)
  ->  Append  (cost=0.00..305.00 rows=5665 width=10)
        ->  Seq Scan on parent  (cost=0.00..61.00 rows=1133 width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
        ->  Seq Scan on child1 parent  (cost=0.00..61.00 rows=1133 
width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
        ->  Seq Scan on child30 parent  (cost=0.00..61.00 rows=1133 
width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
        ->  Seq Scan on child31 parent  (cost=0.00..61.00 rows=1133 
width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
        ->  Seq Scan on child32 parent  (cost=0.00..61.00 rows=1133 
width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
(12 rows)

This should work for all year ranges and I think is a good solution for 
partitioning on year with a fixed amount of partitions.

From the optimizer perspective I wonder what the best access path for 
this kind of query would be (if there would be no partitions). Building 
on ideas from one of Thom Brown's first replies with indexes on year and 
doy, and Tom Lane's remark about the leap year problem. Suppose the leap 
years did not exist, having a index on year, and having a different 
index on doy, sounds like a bitmap and of a scan of both the year and 
doy indexes could provide a optimal path. Maybe this would still be 
possible, if the leap year problem could be 'fixed' by a additional 
condition in the where clause that filters the surplus records.

regards,
Yeb Havinga

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance