Re: Optimize date query for large child tables: GiST or GIN?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



David Jarvis wrote:

    Also, you're trying to do constraint_exclusion, but have you made sure
    that it's turned on?  And have you made sure that those
    constraints are
    really the right ones and that they make sense?  You're using a
    bunch of
    extract()'s there too, why not just specify a CHECK constraint on the
    date ranges which are allowed in the table..?


I don't know what the date ranges are? So I can't partition them by year?

Right now I created 72 child tables by using the category and month. This may have been a bad choice. But at least all the data is in the system now so dissecting or integrating it back in different ways shouldn't take days.

Thanks everyone for all your help, I really appreciate the time you've taken to guide me in the right direction to make the system as fast as it can be.

My $0.02 - its hard to comment inline due to the number of responses, but: the partitioning is only useful for speed, if it matches how your queries select data. For time based data I would for sure go for year based indexing. If you want a fixed number of partitions, you could perhaps do something like year % 64. I did a test to see of the constraint exclusion could work with extract but that failed:

test=# create table parent(t timestamptz);
test=# create table child1(check ((extract(year from t)::int % 2)=0)) inherits( parent); test=# create table child2(check ((extract(year from t)::int % 2)=1)) inherits(parent); test=# explain select * from parent where (extract(year from t)::int % 2) = 0; QUERY PLAN ---------------------------------------------------------------------------
Result  (cost=0.00..158.40 rows=33 width=8)
  ->  Append  (cost=0.00..158.40 rows=33 width=8)
        ->  Seq Scan on parent  (cost=0.00..52.80 rows=11 width=8)
              Filter: (((date_part('year'::text, t))::integer % 2) = 0)
        ->  Seq Scan on child1 parent  (cost=0.00..52.80 rows=11 width=8)
              Filter: (((date_part('year'::text, t))::integer % 2) = 0)
        ->  Seq Scan on child2 parent  (cost=0.00..52.80 rows=11 width=8)
              Filter: (((date_part('year'::text, t))::integer % 2) = 0)

It hits all partitions even when I requested for a single year.

So an extra column would be needed, attempt 2 with added year smallint.

test=# create table parent(t timestamptz, y smallint);
test=# create table child1(check ((y % 2)=0)) inherits( parent);
test=# create table child2(check ((y % 2)=1)) inherits( parent);
test=# explain select * from parent where (y % 2) between 0 and 0;
QUERY PLAN ---------------------------------------------------------------------------------
Result  (cost=0.00..122.00 rows=20 width=10)
  ->  Append  (cost=0.00..122.00 rows=20 width=10)
        ->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer % 2) <= 0))
        ->  Seq Scan on child1 parent  (cost=0.00..61.00 rows=10 width=10)
Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer % 2) <= 0))

This works: only one child table hit.

That made me think: if you'd scan two consecutive years, you'd always hit two different partitions. For your use case it'd be nice if some year wraparounds would fall in the same partition. The following query shows partition numbers for 1900 - 2010 with 4 consecutive years in the same partition. It also shows that in this case 32 partitions is enough:

test=# select x, (x / 4) % 32 from generate_series(1900,2010) as x(x);
x   | ?column?
------+----------
1900 |       27
1901 |       27
1902 |       27
1903 |       27
1904 |       28
1905 |       28
etc
1918 |       31
1919 |       31
1920 |        0
1921 |        0
etc
2005 |       21
2006 |       21
2007 |       21
2008 |       22
2009 |       22
2010 |       22
(111 rows)

This would mean that a extra smallint column is needed which would inflate the 300M relation with.. almost a GB, but I still think it'd be a good idea.

create or replace function yearmod(int) RETURNS int
as 'select (($1 >> 2) %32);'
language sql
immutable
strict;

create table parent(t timestamptz, y smallint);

select 'create table child'||x||'(check (yearmod(y)='||x-1||')) inherits(parent);' from generate_series(1,32) as x(x); ?column? ---------------------------------------------------------------
create table child1(check (yearmod(y)=0)) inherits(parent);
create table child2(check (yearmod(y)=1)) inherits(parent);
create table child3(check (yearmod(y)=2)) inherits(parent);
etc
create table child30(check (yearmod(y)=29)) inherits(parent);
create table child31(check (yearmod(y)=30)) inherits(parent);
create table child32(check (yearmod(y)=31)) inherits(parent);
(32 rows)

Copy and paste output of this query in psql to create child tables.

Example with period 1970 to 1980:

test=# explain select * from parent where yearmod(y) between yearmod(1970) and yearmod(1980); QUERY PLAN -----------------------------------------------------------------------------
Result  (cost=0.00..305.00 rows=50 width=10)
  ->  Append  (cost=0.00..305.00 rows=50 width=10)
        ->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child13 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child14 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child15 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
        ->  Seq Scan on child16 parent  (cost=0.00..61.00 rows=10 width=10)
              Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
(12 rows)

This works: query for 11 consecutive years hits only 4 from 31.

But the between fails for yearmods that wrap the 31 boundary, what happens here between 1910 and 1920

test=# explain select * from parent where yearmod(y) between yearmod(1910) and yearmod(1920); QUERY PLAN ------------------------------------------
Result  (cost=0.00..0.01 rows=1 width=0)
  One-Time Filter: false
(2 rows)

So for the wraparound case we need a CASE:

test=# explain select * from parent where case when yearmod(1910) <= yearmod(1920)
then yearmod(y) between yearmod(1910) and yearmod(1920)
else (yearmod(y) >= yearmod(1910) or yearmod(y) <= yearmod(1920)) end;
QUERY PLAN -------------------------------------------------------------------------------
Result  (cost=0.00..305.00 rows=5665 width=10)
  ->  Append  (cost=0.00..305.00 rows=5665 width=10)
        ->  Seq Scan on parent  (cost=0.00..61.00 rows=1133 width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child1 parent (cost=0.00..61.00 rows=1133 width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child30 parent (cost=0.00..61.00 rows=1133 width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child31 parent (cost=0.00..61.00 rows=1133 width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
-> Seq Scan on child32 parent (cost=0.00..61.00 rows=1133 width=10)
              Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
(12 rows)

This should work for all year ranges and I think is a good solution for partitioning on year with a fixed amount of partitions.

From the optimizer perspective I wonder what the best access path for this kind of query would be (if there would be no partitions). Building on ideas from one of Thom Brown's first replies with indexes on year and doy, and Tom Lane's remark about the leap year problem. Suppose the leap years did not exist, having a index on year, and having a different index on doy, sounds like a bitmap and of a scan of both the year and doy indexes could provide a optimal path. Maybe this would still be possible, if the leap year problem could be 'fixed' by a additional condition in the where clause that filters the surplus records.

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux