Re: experiments in query optimization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi Eliot,

Thanks for the comment.

On Thu, 1 Apr 2010, Eliot Gable wrote:

On Thu, Apr 1, 2010 at 7:46 AM, Faheem Mitha <faheem@xxxxxxxxxxxxx> wrote:

Looking at this more closely, idlink_id and anno_id are primary keys, so already have indexes on them, so my understanding (from the docs) is there is no purpose in creating them. That's why I removed the indexes that were there (back last August, actually, according to my logs). Anyway, doesn't look there is anything I can do here. Does anyone have additions or corrections to this?

When you do a join, you typically have a foreign key in one table referencing a primary key in another table. While designating a foreign key does put a constraint on the key to ensure referential integrity, it does not put an index on the column that is being designated as a foreign key. If I understand correctly, the scan done as the inner loop of the nested loop scan for the join is going to be your foreign key column, not your primary key column. Thus, if you have no index on the foreign key column, you will be forced to do a sequential table scan to do the join. In that case the hash-based join will almost certainly be faster (especially for such a large number of rows). If you put an index on the foreign key, then the inner scan can be an index scan and that might turn out to be faster than building the hash indexes on all the table rows.

Somebody can correct me if I'm wrong.

I had set the foreign keys in question (on the geno table) to be primary keys. This is because this setup is basically a glorified spreadsheet, and I don't want more than one cell corresponding to a particular tuple of idlink.id and anno.id (the conceptual rows and cols). Since a primary key defines an index, I thought putting indexes on idlink_id and anno_id was redundant. However, it looks like (unsurprisingly) the index corresponding to the primary key is across both columns, which may not be what is wanted for the aforesaid join. Ie.

ALTER TABLE ONLY geno ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id)

(As a side comment, with respect to the indexes on the other side of the joins, in one case, we have idlink.id = geno.idlink_id, and idlink.id is a primary key too. In the other, namely geno.anno_id = dedup_patient_anno.id, dedup_patient_anno is a CTE, so no index on dedup_patient_anno.id. But maybe indexes aren't needed there.)

Here is the join

   SELECT decode_genotype(geno.snpval_id, %(allelea)s, %(alleleb)s) AS g,
     geno.idlink_id, geno.anno_id
     FROM    geno
     INNER JOIN dedup_patient_anno
     ON      geno.anno_id = dedup_patient_anno.id
     INNER JOIN idlink
     ON      geno.idlink_id = idlink.id
     ORDER BY idlink_id, anno_id

Here is the table dump.

****************************************************************
-- Name: geno; Type: TABLE; Schema: hapmap; Owner: snp; Tablespace:
--
CREATE TABLE geno (
    idlink_id integer NOT NULL,
    anno_id integer NOT NULL,
    snpval_id integer NOT NULL
)
WITH (autovacuum_enabled=true);

ALTER TABLE hapmap.geno OWNER TO snp;
--
-- Name: geno_pkey; Type: CONSTRAINT; Schema: hapmap; Owner: snp; Tablespace:
--
ALTER TABLE ONLY geno
    ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id); (!!!!)
--
-- Name: geno_anno_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_anno_id_fkey FOREIGN KEY (anno_id) REFERENCES anno(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_idlink_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_idlink_id_fkey FOREIGN KEY (idlink_id) REFERENCES idlink(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_snpval_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_snpval_id_fkey FOREIGN KEY (snpval_id) REFERENCES snpval(val) ON UPDATE CASCADE ON DELETE CASCADE;
*************************************************************************

So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?

                                                      Regards, Faheem.

--
Eliot Gable

"We do not inherit the Earth from our ancestors: we borrow it from our children." ~David Brower

"I decided the words were too conservative for me. We're not borrowing from our children, we're stealing from them--and it's not even considered to be a crime." ~David Brower

Nice quotes.

"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; not live to eat.) ~Marcus Tullius Cicero


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux