Re: Slow query with self-join, group by, 100m rows

Robert Klemme <shortcutter@xxxxxxxxxxxxxx> · Wed, 21 Sep 2011 16:51:26 +0200

On Tue, Sep 20, 2011 at 7:43 PM, Thomas Kappler <tkappler@xxxxxxxxxxxxxx> wrote:
> [please CC, I'm not on the list]
>
> Hi all,
>
> we have one table that basically uses Postgres as a key-value store.
>
>     Table "public.termindex"
> Column   |  Type   | Modifiers
> -------------+---------+-----------
>  subject_id | integer |
>  indextype  | integer |
>  cid        | integer |
>
> This is with Postgres 9.0.
>
> The table has 96 million rows and an index on each column. It contains
> no NULLs and has no triggers.
>
> subject_id has about 2m distinct values, cid about 200k, and indextype only six.
>
> The table is *read-only* after the initial load.
>
> The query we want to do is (with example values):
>
> select t.cid, count(distinct t1.subject_id)
> from termindex t1, termindex t2
> where t1.cid=20642 and t1.indextype=636 and t2.indextype=636 and
> t2.subject_id=t1.subject_id
> group by t2.cid;

Do you have any multi column indexes?  From the text of your query it
seems it could benefit from these two indexes:

(cid, indextype)
(subject_id, indextype)

I do not know whether PostgreSQL can avoid the table if you make the first index
(cid, indextype, subject_id)
in other words: append all the columns needed for the join.  In theory
the query could then be satisfied from the indexes.

> Pasting the explain analyze output into
> http://explain.depesz.com/s/Yr4 we see that Postgres is doing an
> external sort using about 150MB of data.
>
> Now, we're not Postgres experts, or even great at relational design.
> Are there better ways of doing that query, or designing the table? For
> the latter we do have a number of constraints, though, that I don't
> want to outline now because this mail is already long enough.

Those are probably important to know.

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance