Hi,
I run this query:
select max(a."user"), b.category, count(1) from result a,
domain_categories b where a."domain" = b."domain" group by b.category;
the table result contains all websites a user visited. And the table
domain_categories contains all categories a domain is in.
result has 20 Mio rows and domain_categories has about 12 Mio. There
are 500.000 different users.
I have indexes on result.domain, domain_categories.domain,
result.user, domain_categories.category. Clustered result on user and
domain_categories on domain.
explain analyze says (limited to one user with id 1337):
"HashAggregate (cost=2441577.16..2441614.72 rows=2504 width=8)
(actual time=94667.335..94671.508 rows=3361 loops=1)"
" -> Merge Join (cost=2119158.02..2334105.00 rows=14329622 width=8)
(actual time=63559.938..94621.557 rows=36308 loops=1)"
" Merge Cond: (a.domain = b.domain)"
" -> Sort (cost=395.52..405.49 rows=3985 width=8) (actual
time=0.189..0.211 rows=19 loops=1)"
" Sort Key: a.domain"
" Sort Method: quicksort Memory: 27kB"
" -> Index Scan using result_user_idx on result a
(cost=0.00..157.21 rows=3985 width=8) (actual time=0.027..0.108
rows=61 loops=1)"
" Index Cond: ("user" = 1337)"
" -> Materialize (cost=2118752.28..2270064.64 rows=12104989
width=8) (actual time=46460.599..82336.116 rows=12123161 loops=1)"
" -> Sort (cost=2118752.28..2149014.75 rows=12104989
width=8) (actual time=46460.592..59595.851 rows=12104989 loops=1)"
" Sort Key: b.domain"
" Sort Method: external sort Disk: 283992kB"
" -> Seq Scan on domain_categories b
(cost=0.00..198151.89 rows=12104989 width=8) (actual
time=14.352..22572.869 rows=12104989 loops=1)"
"Total runtime: 94817.058 ms"
This is running on a pretty small server with 1gb of ram and a slow
sata hd. Shared_buffers is 312mb, max_fsm_pages = 153600. Everything
else is commented out. Postgresql v8.3.3. Operating system Ubuntu 8.04.
It would be great if someone could help improve this query. This is
for a research project at my university.
Thanks in advance,
Moritz