Re: FW: Constraint exclusion in partitions

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Sun, 24 May 2015 13:25:15 +0200

Hi Daniel:

On Sat, May 23, 2015 at 8:37 PM, Daniel Begin <jfd553@xxxxxxxxxxx> wrote:
> Following Francisco suggestion, I was able to do some tests earlier this morning when the partitioning process completed and all the resulting tables analyzed.

> Here is what I got on both the original table and its partitioned counterpart while running the same queries. I tested them only for a couple of values but in summary...
> Using a constant id:
> All the queries I tried took longer on the partitioned table! I got similar results for multiple records using IN (id value1, id value2 ...)
> Using a range of ids:
> Surprisingly again, all the queries I tried took longer on the partitioned table!
> Using a list of ids from a select clause:
> More surprisingly, the queries I tried took less time on the partitioned table at least when using the primary key. Using an indexed field took so long compared to the old table that I cancelled the execution for the new one!

> Guess what, I will get back to my old fat table unless someone tells me I missed something obvious!
>
> Note: Tables/indexes description, queries and execution plans are below.
>
>
>
>
>
>
> Tables/indexes description ----------------------------------------------------------------------------------
> The original table has 3870130000 records. Primary key/index on each partition queries are
> ALTER TABLE oldtable ADD CONSTRAINT oldtable_idversion_pk PRIMARY KEY (id, version);
> CREATE INDEX oldtable_groupid_idx ON oldtable USING btree (group_id);
>
> The partitioned table has 3870130000 records distributed over 87 partitions. Primary key/index on each partition queries are
> ALTER TABLE newtable_xx ADD CONSTRAINT newtablexx_idversion_pk PRIMARY KEY (id, version);
> CREATE INDEX newtablexx_groupid_idx ON newtable_xx USING btree (group_id);
> Where xx is the partition's number suffix

It is missing here, but I supose you did not forget to add the
exclusion check constraint.

> constant id -------------------------------------------------------------------------------------------------------
> select * from oldtable where id=123456789;
> "Index Scan using oldtable_idversion_pk on oldtable  (cost=0.70..4437.15 rows=1682 width=66)"
> "  Index Cond: (id = 123456789::bigint)"
> --Total query runtime: 62 ms. 1 rows retrieved
> select * from newtable where id=123456789;
> "Append  (cost=0.00..20.19 rows=5 width=66)"
> "  ->  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)"
> "        Filter: (id = 123456789::bigint)"
> "  ->  Index Scan using newtable72_idversion_pk on newtable_72  (cost=0.56..20.19 rows=4 width=66)"
> "        Index Cond: (id = 123456789::bigint)"
> --Total query runtime: 156 ms. 1 rows retrieved

This is not surprissing, partitions never help much against that type
of queries, with 87 partitions index are not going to be much
shallower and the optimizer has a lot more of work to do. Just a
couple points. When doing a lot of sequential queries on nearby ids (
on the same partition ) it will help ( better data locality,
possibility of having more relevant pages in the cache ). Also, how
did you do your timinngs? ( you should repeat each query 2 or 3 times,
to see if the times go down a lot due to caching and, if you can, time
a couple of explains ( plain, not analyze ) to see how much time the
planner takes. As an example, the explain I sent you in my previous
message takes between 20 and 30 milliseconds and I only have 17 live
partitions ( I routinely un-inherit and move to another schema
partitions older than a two years, to help the planner, and re-inherit
them if needed ) ). Also, your queries seem to be very, very slow for
a single indexed fetch, if I do this on my tables ( similar query to
what I sent, modified to hit just 1 partition ):
select count(*) from carrier_cdrs where setup between
'20150107T123456' and '20150107T222222';
It takes 40 ms in the first go, drops down to 27 after that, and I
have ~15 ms RTT to the server ( which is more or less the time
reported when I do a plain 'select 1' ). I mean, I suspect your
measurements are not good, as they seem too slow. Also, when I use
explain analyze on the previous query the server reports 13 ms, which
is more or less one RTT less ( as the server starts measuring after
receiving the query and ends before sending the reply ).

Another thing, how are you getting the explain results ? I would urge
you to use explain analyze, as, apart of seeming too slow, the
stimations seem to be way off ( explain analyze sends you the
estimated and actual results, so it is easy to see, and executes
everything on the server, so result transmission time, which is not of
use as it has to be the same for every method of calculating the same
correct result, is not reported ). For my query I get this:

            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1229.36..1229.37 rows=1 width=0) (actual
time=13.794..13.794 rows=1 loops=1)
   ->  Append  (cost=0.00..1169.73 rows=23851 width=0) (actual
time=0.014..12.069 rows=24016 loops=1)
         ->  Seq Scan on carrier_cdrs  (cost=0.00..0.00 rows=1
width=0) (actual time=0.000..0.000 rows=0 loops=1)
               Filter: ((setup >= '2015-01-07 12:34:56+01'::timestamp
with time zone) AND (setup <= '2015-01-07 22:22:22+01'::timestamp with
time zone))
         ->  Index Only Scan using idx_carrier_cdrs_201501_setup on
carrier_cdrs_201501  (cost=0.42..1169.73 rows=23850 width=0) (actual
time=0.014..10.162 rows=24016 loops=1)
               Index Cond: ((setup >= '2015-01-07
12:34:56+01'::timestamp with time zone) AND (setup <= '2015-01-07
22:22:22+01'::timestamp with time zone))
               Heap Fetches: 24016
 Total runtime: 13.819 ms
(8 rows)

As you can see this reports the time for all the nodes, and the
estimated and actual numbers of rows in every step, so things are
easier to check ( like how the planner estimated 23850 rows which
where really 24016, a .69% error, indicating good estimation ).

> I got similar results for multiple records...
> select * from oldtable where id IN(10000000,1000000000,2000000000,3000000000);
> "Index Scan using oldtable_idversion_pk on oldtable  (cost=0.71..17739.18 rows=6726 width=66)"
> "  Index Cond: (id = ANY ('{10000000,1000000000,2000000000,3000000000}'::bigint[]))"
> --Total query runtime: 187 ms. 4 rows retrieved

This seems to slow too, also, estimated rows 6726 for an unique index
with 4 values seems totally bad.

> select * from newtable where id IN(10000000,1000000000,2000000000,3000000000);
> "Append  (cost=0.00..933.40 rows=223 width=66)"
> "  ->  Seq Scan on newtable  (cost=0.00..0.00 rows=1 width=66)"
> "        Filter: (id = ANY ('{10000000,1000000000,2000000000,3000000000}'::bigint[]))"
> "  ->  Index Scan using newtable01_idversion_pk on newtable_01  (cost=0.57..622.78 rows=156 width=66)"
> "        Index Cond: (id = ANY ('{10000000,1000000000,2000000000,3000000000}'::bigint[]))"

> ...
********* How many partitions where omitted here? ( if they were two
the problem of redudant checking of id is minor, if they where 85 it
means constratint exclusion is not working for in )

> "  ->  Index Scan using newtable85_idversion_pk on newtable_85  (cost=0.57..53.37 rows=9 width=66)"
> "        Index Cond: (id = ANY ('{10000000,1000000000,2000000000,3000000000}'::bigint[]))"
> --Total query runtime: 421 ms. 4 rows retrieved

And here it is doing strange things ( fetching every value in every partition

> range of ids -------------------------------------------------------------------------------------------------------
> select * from oldtable where id between 1522999949 and 1523000049;
> "Index Scan using oldtable_idversion_pk on oldtable  (cost=0.70..383.51 rows=144 width=66)"
> "  Index Cond: ((id >= 1522999949) AND (id <= 1523000049))"
> Total query runtime: 47 ms. 53 rows retrieved.

As I was telling you above. It is not normal to take 62 ms for a
single id and just 47 for a 53 range. And this is referring to the old
table case. I keep thinking you are trashing the caches, and your
measurements are not good, you are measuring cache effects ( as I said
before, the easiest ways to discount them is to repeat every query
three or more times in sucession, using explain analyze to discard
network effects, and see where they stabilize )

> Using an index, not the primary key ------------------------------------------------------------------------------

Those are going to be specially bad. If you partition and then do
indexed queries against a non partitioned field WITHOUT extra
conditions which narrow the result to a couple of them, it will always
take a lot of time. The server needs to scan every index ( as your
query is basically a union all of every partition ). Think on it this
way, if I have one big phone call table, partitioned by months,
indexed by timestamp and number ( 2 indexes ) , queries for a TS range
go fast, as they go to the relevant partitions. Queries for a number
with a some-days timestamp range also go fast, as it has to index scan
one or two partitions for the number, but queries for just the number
need to index scan every partition, and they are going to be slow, as
they need to read some pages of every number index.

> --Subset provides 58 group_id pointing to 5978 records in the concerned tables
> select * from oldtable where group_id IN (select * from subset)

If you need to do this queries, and group id is uncorrelated wit id,
you are not going to get good results from partitioning.

Regards.
   Francisco Olarte.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general