Oddball data distribution giving me planner headaches

Jeff Amiel <becauseimjeff@xxxxxxxxx> · Fri, 2 Dec 2011 12:19:39 -0800 (PST)

Oddball data distribution giving me headaches.

We have a distinct 'customer' table with customer_id, type and name/demographic information.
Assume some 1 million rows in the customer table.

We then have a customer 'relationship' table which simply contains 2 columns…designating parent and child relationships…but allowing complex hierarchies between customers.

CREATE TABLE customer_rel
(
  parent_customer integer NOT NULL,
  child_customer integer NOT NULL,
 )

8 million rows in this table.  Oddball distribution.  We have some 8 levels of hierarchy (customer type) represented with this table.  Every customer gets an entry where parent/child is themselves…and then for every 'upline'.  At the highest level, we have 'distributors' which have all other customer types underneath them.  Assuming we had some 68 distributors, the entries where THEY are the parent_customer represent nearly a million rows of the 8 million.

I have extracted a simple case from a larger query that was generating an off-beat plan because of the unexpected planner row-counts being spewed by a low level query.

explain analyze 
select * from customer_rel where parent_customer in (select customer_id from customer where customer_type='DISTRIBUTOR')

"  Nested Loop  (cost=25429.44..29626.39 rows=931 width=0) (actual time=216.325..1238.091 rows=1025401 loops=1)"
"        ->  HashAggregate  (cost=25429.44..25430.80 rows=136 width=4) (actual time=216.304..216.339 rows=68 loops=1)"
"              ->  Seq Scan on customer  (cost=0.00..25429.10 rows=136 width=4) (actual time=0.018..216.226 rows=68 loops=1)"
"                    Filter: (customer_type = 'DISTRIBUTOR'::bpchar)"
"        ->  Index Scan using rel_parent on customer_rel  (cost=0.00..30.76 rows=7 width=4) (actual time=0.006..8.190 rows=15079 loops=68)"
"              Index Cond: (parent_customer = customer.customer_id)"
"Total runtime: 1514.810 ms"	

The fact that the top level nested loop THINKS it only will be returning 931 rows (instead of over 1 million) is the killer here…3 orders of magnitude.  The results of this  query are used as part of a bigger query and the screwed up stats are causing all sorts of havoc upline.

I'm experimenting in 9.1.0…have set the statistics to 1000 (and 10000) on both columns (parent and child) to little effect.  Have hardcoded the n_distinct on the parent_customer column to be 1,000,000 also…with no effect (doing analyze of table after each change)

Does this oddball data distribution doom me to poor planning forever?
Any other thoughts?

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general