Hello Nis-
I did reset the defaults before running the explain.
Primary keys for the tables.
sequence_fragment.seq_frag_id
sequence.sequence_id
Candidate keys.
fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id)
sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id).
None of the fields are nullable.
sequence is the anchor table.
seq_frag_id is the primary key (and foreign key to
fragment_external_info) ~4.5 million unique entries
sequence_id is an indexed field. ~3 million distinct IDs
sequence_alignment has 500million entries, but i join on sequence_id
which has ~3million entries.
When I make the suggested changes, the new query is:
select sf.library_id, fio.clip_type , count(sf.sequence_id)
from sequence_fragment sf, fragment_external_info fio
where sf.seq_frag_id = fio.sequence_frag_id
and sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)
group by sf.library_id, fio.clip_type
After making the 2 changes, the cost dropped dramatically... but is
still very high.
Original Explain cost:
cost=1308049564..1345206293 rows=54 width=16
New Explain cost:
cost=11831119..11831120 rows=54 width=16
John
Nis Jørgensen wrote:
John Major skrev:
I am trying to join three quite large tables, and the query is
unbearably slow(meaning I can't get results in more than a day of
processing).
I've tried the basic optimizations I understand, and nothing has
improved the execute speed.... any help with this would be greatly
appreciated
The three tables are quite large:
sequence_fragment = 4.5 million rows
sequence_external_info = 10million rows
sequence_alignment = 500 million rows
The query I am attempting to run is this:
select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
from sequence_alignment sa, sequence_fragment sf,
fragment_external_info fio
where sf.seq_frag_id = fio.sequence_frag_id
and sf.sequence_id = sa.sequence_id
group by sf.library_id, fio.clip_type
NOTES:
~there are indexes on all of the fields being joined (but not on
library_id or clip_type ). ~Everything has been re-analyzed post index
creation
What are the primary (and candidate) keys of the tables? Are any of the
fields nullable? How many distinct values exist for
sequence_alignment.sequence_id?
~I've tried "set enable_seqscan=off" and set (join_table_order or
something) = 1
It would help if you turned the settings back to defaults before doing
the ANALYZE - or provide the results of that case as well.
The explain plan is as follows:
[cut]
Without trying to understand the ANALYZE output, I would suggest two
possible optimizations:
- Using count(distinct(sf.sequence_id)) instead of
count(distinct(sa.sequence_id)).
- Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)".
The first one probably won't help (nor hurt), but the second one might
be able to get rid of the table scan, or at least the need do the full
merge join (which returns an estimated 3 billion rows).
Hope this helps,
Nis
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
message can get through to the mailing list cleanly