[please CC, I'm not on the list] Hi all, we have one table that basically uses Postgres as a key-value store. Table "public.termindex" Column | Type | Modifiers -------------+---------+----------- subject_id | integer | indextype | integer | cid | integer | This is with Postgres 9.0. The table has 96 million rows and an index on each column. It contains no NULLs and has no triggers. subject_id has about 2m distinct values, cid about 200k, and indextype only six. The table is *read-only* after the initial load. The query we want to do is (with example values): select t.cid, count(distinct t1.subject_id) from termindex t1, termindex t2 where t1.cid=20642 and t1.indextype=636 and t2.indextype=636 and t2.subject_id=t1.subject_id group by t2.cid; The goal is to select all subjects matching certain criteria, and for all cids (concept ids) that they have, determine the distribution over the whole population, for a given indextype. For instance, if the criteria are "cid=1 and indextype=636", let's say subjects 1,2,3 match. The indextype we want the distribution for is 999. So we get all cids where subject_id in (1,2,3) and indextype=999, group these by cid with count(subject_id) per cid. The result would look like cid | count -----+------- 12 | 1 13 | 28 ... Another way of asking this query is with an inner select: select cid, count(subject_id) from termindex where subject_id in (select subject_id from termindex where cid=18869 and indextype=636) and indextype=636 group by cid; On this 96m rows table, the join query takes about 50 seconds. EXPLAIN ANALYZE output below. The inner select query takes much longer. Pasting the explain analyze output into http://explain.depesz.com/s/Yr4 we see that Postgres is doing an external sort using about 150MB of data. Now, we're not Postgres experts, or even great at relational design. Are there better ways of doing that query, or designing the table? For the latter we do have a number of constraints, though, that I don't want to outline now because this mail is already long enough. Thanks in advance! Thomas QUERY PLAN --------------------------------------------------------------------- GroupAggregate (cost=446092576.21..459395622.23 rows=200 width=8) (actual time=18927.047..19001.072 rows=2562 loops=1) -> Sort (cost=446092576.21..450526924.05 rows=1773739136 width=8) (actual time=18927.025..18952.726 rows=119429 loops=1) Sort Key: t.cid Sort Method: external merge Disk: 2088kB -> Merge Join (cost=1480064.68..28107420.08 rows=1773739136 width=8) (actual time=14300.547..18836.386 rows=119429 loops=1) Merge Cond: (t1.subject_id = t.subject_id) -> Sort (cost=44173.64..44278.93 rows=42116 width=4) (actual time=30.148..33.965 rows=14466 loops=1) Sort Key: t1.subject_id Sort Method: external merge Disk: 200kB -> Bitmap Heap Scan on mcindex t1 (cost=791.57..40361.19 rows=42116 width=4) (actual time=3.901..18.655 rows=14466 loops=1) Recheck Cond: (cid = 20642) -> Bitmap Index Scan on mc2 (cost=0.00..781.04 rows=42116 width=0) (actual time=3.319..3.319 rows=14466 loops=1) Index Cond: (cid = 20642) -> Materialize (cost=1435891.04..1478006.60 rows=8423113 width=8) (actual time=14270.211..17554.299 rows=8423170 loops=1) -> Sort (cost=1435891.04..1456948.82 rows=8423113 width=8) (actual time=14270.202..16346.835 rows=8423094 loops=1) Sort Key: t.subject_id Sort Method: external merge Disk: 148232kB -> Seq Scan on mcindex t (cost=0.00..121502.13 rows=8423113 width=8) (actual time=0.012..1381.282 rows=8423113 loops=1) Total runtime: 22095.280 ms (19 rows) -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance