Re: Slow query with a lot of data

Moritz Onken <onken@xxxxxxxxxxxxxxxx> · Tue, 19 Aug 2008 11:03:34 +0200

Am 18.08.2008 um 18:05 schrieb Matthew Wakeling:

On Mon, 18 Aug 2008, Moritz Onken wrote:
Running the query for more than one user is indeed not much slower.  
That's what I need. I'm clustering the results table on domain  
right now. But why is this better than clustering it on "user"?

The reason is the way that the merge join algorithm works. What it  
does is takes two tables, and sorts them both by the join fields.  
Then it can stream through both tables producing results as it goes.  
It's the best join algorithm, but it does require both tables to be  
sorted by the same thing, which is domain in this case. The  
aggregating on user happens after the join has been done, and the  
hash aggregate can accept the users in random order.

If you look at your last EXPLAIN, see that it has to sort the result  
table on domain, although it can read the domain_categories in  
domain order due to the clustered index.

explain select
  a."user", b.category, sum(1.0/b.cat_count)::float
  from result a, domain_categories b
  where a."domain" = b."domain"
  group by a."user", b.category;

"GroupAggregate  (cost=21400443313.69..22050401897.13 rows=35049240  
width=12)"
"  ->  Sort  (cost=21400443313.69..21562757713.35 rows=64925759864  
width=12)"
"        Sort Key: a."user", b.category"
"        ->  Merge Join  (cost=4000210.40..863834009.08  
rows=64925759864 width=12)"
"              Merge Cond: (b.domain = a.domain)"
"              ->  Index Scan using domain_categories_domain on  
domain_categories b  (cost=0.00..391453.79 rows=12105014 width=12)"
"              ->  Materialize  (cost=3999931.73..4253766.93  
rows=20306816 width=8)"
"                    ->  Sort  (cost=3999931.73..4050698.77  
rows=20306816 width=8)"
"                          Sort Key: a.domain"
"                          ->  Seq Scan on result a   
(cost=0.00..424609.16 rows=20306816 width=8)"

Both results and domain_categories are clustered on domain and analyzed.
It took 50 minutes to run this query for 280 users ("and "user" IN  
([280 ids])"), 78000 rows were returned and stored in a table. Is this  
reasonable?
Why is it still sorting on domain? I thought the clustering should  
prevent the planner from doing this?

moritz