optimizing common subqueries

Kevin Murphy <murphy@xxxxxxxxxxxxxxx> · Tue, 04 Oct 2005 15:06:00 -0400

My understanding is that PG does not eliminate common subqueries within 
complex queries.  I just wanted to confirm this.

I also tried wrapping subqueries in STABLE or IMMUTABLE functions 
returning SETOF, but from the PG 8.0.3 log, I see that the function is 
invoked redundantly anyway.  I know that disk buffering has the effect 
of optimizing this sort of thing to some extent, but I was wondering if 
I can do better.

Any other possibilities?

The code below shows what I am trying to do.  I created functions to 
make it easy for you to see where the common subqueries occur (note that 
one of the functions invokes the other function, which increases the 
redundancy.)

CREATE FUNCTION pmids_by_mention(text) RETURNS SETOF integer AS $$
 SELECT pmid
 FROM normalized_genes
 WHERE symbol IN (
   SELECT DISTINCT symbol
   FROM normalized_genes
   WHERE mention = $1
 );
$$ LANGUAGE SQL STRICT IMMUTABLE;

CREATE FUNCTION common_pmids(text,text) RETURNS SETOF integer AS $$
 SELECT * FROM pmids_by_mention($1)
 INTERSECT
 SELECT * FROM pmids_by_mention($2)
$$ LANGUAGE SQL STRICT IMMUTABLE;

SELECT PMID, COUNT(*) AS total
FROM (
 SELECT pmid FROM pmids_by_mention('mycn') as pmid
   WHERE pmid IN (
     SELECT * FROM common_pmids('mycn','trka')
   )
 UNION ALL
 SELECT pmid FROM pmids_by_mention('trka') as pmid
   WHERE pmid IN (
     SELECT * FROM common_pmids('mycn','trka')
   )
) AS subq
GROUP BY pmid
ORDER BY total desc;

I doubt anybody cares, but ... I am doing an article lookup by genes 
which are mentioned in articles (pmid), where each gene may be referred 
to indirectly via an alias (mention).  Each gene symbol has many 
aliases/mentions.  (Unfortunately, it is also possible but rare for an 
alias/mention to map to more than one gene symbol).  The query logic is 
as follows.  For each mention/alias supplied by the user, find all 
articles connected to the gene (or, rarely, genes), even indirectly.  
Take the intersection of these sets to find the articles/pmid's 
containing all the specified genes.  Unfortunately, in order to rank the 
articles by relevance, it is further necessary to come up with a result 
set containing one row for each qualifying gene mention in the set of 
matching articles.  There can be any number of search terms (up to some 
limit), so the actual query has to be built  dynamically on the fly by 
the application.

Thanks,
Kevin Murphy

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings