Hi Eliot,
Thanks for the comment.
On Thu, 1 Apr 2010, Eliot Gable wrote:
On Thu, Apr 1, 2010 at 7:46 AM, Faheem Mitha <faheem@xxxxxxxxxxxxx> wrote:
Looking at this more closely, idlink_id and anno_id are primary keys, so
already have indexes on them, so my understanding (from the docs) is
there is no purpose in creating them. That's why I removed the indexes
that were there (back last August, actually, according to my logs).
Anyway, doesn't look there is anything I can do here. Does anyone have
additions or corrections to this?
When you do a join, you typically have a foreign key in one table
referencing a primary key in another table. While designating a foreign
key does put a constraint on the key to ensure referential integrity, it
does not put an index on the column that is being designated as a
foreign key. If I understand correctly, the scan done as the inner loop
of the nested loop scan for the join is going to be your foreign key
column, not your primary key column. Thus, if you have no index on the
foreign key column, you will be forced to do a sequential table scan to
do the join. In that case the hash-based join will almost certainly be
faster (especially for such a large number of rows). If you put an index
on the foreign key, then the inner scan can be an index scan and that
might turn out to be faster than building the hash indexes on all the
table rows.
Somebody can correct me if I'm wrong.
I had set the foreign keys in question (on the geno table) to be primary
keys. This is because this setup is basically a glorified spreadsheet, and
I don't want more than one cell corresponding to a particular tuple of
idlink.id and anno.id (the conceptual rows and cols). Since a primary key
defines an index, I thought putting indexes on idlink_id and anno_id was
redundant. However, it looks like (unsurprisingly) the index corresponding
to the primary key is across both columns, which may not be what is wanted
for the aforesaid join. Ie.
ALTER TABLE ONLY geno ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id)
(As a side comment, with respect to the indexes on the other side of the
joins, in one case, we have idlink.id = geno.idlink_id, and idlink.id is a
primary key too. In the other, namely geno.anno_id =
dedup_patient_anno.id, dedup_patient_anno is a CTE, so no index on
dedup_patient_anno.id. But maybe indexes aren't needed there.)
Here is the join
SELECT decode_genotype(geno.snpval_id, %(allelea)s, %(alleleb)s) AS g,
geno.idlink_id, geno.anno_id
FROM geno
INNER JOIN dedup_patient_anno
ON geno.anno_id = dedup_patient_anno.id
INNER JOIN idlink
ON geno.idlink_id = idlink.id
ORDER BY idlink_id, anno_id
Here is the table dump.
****************************************************************
-- Name: geno; Type: TABLE; Schema: hapmap; Owner: snp; Tablespace:
--
CREATE TABLE geno (
idlink_id integer NOT NULL,
anno_id integer NOT NULL,
snpval_id integer NOT NULL
)
WITH (autovacuum_enabled=true);
ALTER TABLE hapmap.geno OWNER TO snp;
--
-- Name: geno_pkey; Type: CONSTRAINT; Schema: hapmap; Owner: snp;
Tablespace:
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id); (!!!!)
--
-- Name: geno_anno_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner:
snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_anno_id_fkey FOREIGN KEY (anno_id) REFERENCES
anno(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_idlink_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner:
snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_idlink_id_fkey FOREIGN KEY (idlink_id) REFERENCES
idlink(id) ON UPDATE CASCADE ON DELETE CASCADE;
--
-- Name: geno_snpval_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner:
snp
--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_snpval_id_fkey FOREIGN KEY (snpval_id) REFERENCES
snpval(val) ON UPDATE CASCADE ON DELETE CASCADE;
*************************************************************************
So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?
Regards, Faheem.
--
Eliot Gable
"We do not inherit the Earth from our ancestors: we borrow it from our
children." ~David Brower
"I decided the words were too conservative for me. We're not borrowing
from our children, we're stealing from them--and it's not even
considered to be a crime." ~David Brower
Nice quotes.
"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live;
not live to eat.) ~Marcus Tullius Cicero
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance