Re: experiments in query optimization

Faheem Mitha <faheem@xxxxxxxxxxxxx> · Thu, 1 Apr 2010 23:45:13 +0530 (IST)

Hi Eliot,

Thanks for the comment.

On Thu, 1 Apr 2010, Eliot Gable wrote:

On Thu, Apr 1, 2010 at 7:46 AM, Faheem Mitha <faheem@xxxxxxxxxxxxx> wrote:

Looking at this more closely, idlink_id and anno_id are primary keys, so 
already have indexes on them, so my understanding (from the docs) is 
there is no purpose in creating them. That's why I removed the indexes 
that were there (back last August, actually, according to my logs). 
Anyway, doesn't look there is anything I can do here. Does anyone have 
additions or corrections to this?

When you do a join, you typically have a foreign key in one table 
referencing a primary key in another table. While designating a foreign 
key does put a constraint on the key to ensure referential integrity, it 
does not put an index on the column that is being designated as a 
foreign key. If I understand correctly, the scan done as the inner loop 
of the nested loop scan for the join is going to be your foreign key 
column, not your primary key column. Thus, if you have no index on the 
foreign key column, you will be forced to do a sequential table scan to 
do the join. In that case the hash-based join will almost certainly be 
faster (especially for such a large number of rows). If you put an index 
on the foreign key, then the inner scan can be an index scan and that 
might turn out to be faster than building the hash indexes on all the 
table rows.

Somebody can correct me if I'm wrong.

I had set the foreign keys in question (on the geno table) to be primary 
keys. This is because this setup is basically a glorified spreadsheet, and 
I don't want more than one cell corresponding to a particular tuple of 
idlink.id and anno.id (the conceptual rows and cols). Since a primary key 
defines an index, I thought putting indexes on idlink_id and anno_id was 
redundant. However, it looks like (unsurprisingly) the index corresponding 
to the primary key is across both columns, which may not be what is wanted 
for the aforesaid join. Ie.

ALTER TABLE ONLY geno ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id)

(As a side comment, with respect to the indexes on the other side of the 
joins, in one case, we have idlink.id = geno.idlink_id, and idlink.id is a 
primary key too. In the other, namely geno.anno_id = 
dedup_patient_anno.id, dedup_patient_anno is a CTE, so no index on 
dedup_patient_anno.id. But maybe indexes aren't needed there.)

Here is the join

   SELECT decode_genotype(geno.snpval_id, %(allelea)s, %(alleleb)s) AS g,
     geno.idlink_id, geno.anno_id
     FROM    geno
     INNER JOIN dedup_patient_anno
     ON      geno.anno_id = dedup_patient_anno.id
     INNER JOIN idlink
     ON      geno.idlink_id = idlink.id
     ORDER BY idlink_id, anno_id

Here is the table dump.

****************************************************************
-- Name: geno; Type: TABLE; Schema: hapmap; Owner: snp; Tablespace:
--
CREATE TABLE geno (
    idlink_id integer NOT NULL,
    anno_id integer NOT NULL,
    snpval_id integer NOT NULL
)
WITH (autovacuum_enabled=true);

ALTER TABLE hapmap.geno OWNER TO snp;
--
-- Name: geno_pkey; Type: CONSTRAINT; Schema: hapmap; Owner: snp; 
Tablespace:
--
ALTER TABLE ONLY geno
    ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id); (!!!!)
--
-- Name: geno_anno_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp
--
ALTER TABLE ONLY geno
    ADD CONSTRAINT geno_anno_id_fkey FOREIGN KEY (anno_id) REFERENCES 
anno(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_idlink_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp
--
ALTER TABLE ONLY geno
    ADD CONSTRAINT geno_idlink_id_fkey FOREIGN KEY (idlink_id) REFERENCES 
idlink(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_snpval_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp
--
ALTER TABLE ONLY geno
    ADD CONSTRAINT geno_snpval_id_fkey FOREIGN KEY (snpval_id) REFERENCES 
snpval(val) ON UPDATE CASCADE ON DELETE CASCADE;
*************************************************************************

So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?

                                                      Regards, Faheem.

--
Eliot Gable

"We do not inherit the Earth from our ancestors: we borrow it from our 
children." ~David Brower

"I decided the words were too conservative for me. We're not borrowing 
from our children, we're stealing from them--and it's not even 
considered to be a crime." ~David Brower

Nice quotes.

"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; 
not live to eat.) ~Marcus Tullius Cicero

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance