Re: How to improve speed of 3 table join &group (HUGE tables)

John Major <major@xxxxxxxxxxxxxx> · Thu, 18 Oct 2007 15:46:19 -0400

Hello Nis-

I did reset the defaults before running the explain.

Primary keys for the tables.
 sequence_fragment.seq_frag_id
 sequence.sequence_id

Candidate keys.
 fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id)
 sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id).

None of the fields are nullable.

sequence is the anchor table.
 seq_frag_id is the primary key (and foreign key to 
fragment_external_info) ~4.5 million unique entries
 sequence_id is an indexed field. ~3 million distinct IDs

sequence_alignment has 500million entries, but i join on sequence_id 
which has ~3million entries.

When I make the suggested changes, the new query is:
 select sf.library_id, fio.clip_type , count(sf.sequence_id)
   from    sequence_fragment sf, fragment_external_info fio
   where sf.seq_frag_id = fio.sequence_frag_id
   and  sf.sequence_id IN
       (SELECT sequence_id from sequence_alignment)
   group by sf.library_id, fio.clip_type

After  making the 2 changes, the cost dropped dramatically... but is 
still very high.
Original Explain cost:
cost=1308049564..1345206293 rows=54 width=16

New Explain cost:
cost=11831119..11831120 rows=54 width=16

John

Nis Jørgensen wrote:
John Major skrev:

I am trying to join three quite large tables, and the query is
unbearably slow(meaning I can't get results in more than a day of
processing).
I've tried the basic optimizations I understand, and nothing has
improved the execute speed.... any help with this would be greatly
appreciated

The three tables are quite large:
 sequence_fragment = 4.5 million rows
 sequence_external_info = 10million rows
 sequence_alignment = 500 million rows

The query I am attempting to run is this:

select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
from   sequence_alignment sa,  sequence_fragment sf,
fragment_external_info fio
where sf.seq_frag_id = fio.sequence_frag_id
and  sf.sequence_id = sa.sequence_id
group by sf.library_id, fio.clip_type

NOTES:
~there are indexes on all of the fields being joined (but not on
library_id or clip_type ). ~Everything has been re-analyzed post index
creation

What are the primary (and candidate) keys of the tables? Are any of the
fields nullable? How many distinct values exist for
sequence_alignment.sequence_id?

~I've tried "set enable_seqscan=off" and set (join_table_order or
something) = 1

It would help if you turned the settings back to defaults before doing
the ANALYZE - or provide the results of that case as well.

The explain plan is as follows:

[cut]

Without trying to understand the ANALYZE output, I would suggest two
possible optimizations:

- Using count(distinct(sf.sequence_id)) instead of
count(distinct(sa.sequence_id)).

- Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)".

The first one probably won't help (nor hurt), but the second one might
be able to get rid of the table scan, or at least the need do the full
merge join (which returns an estimated 3 billion rows).

Hope this helps,

Nis

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
      message can get through to the mailing list cleanly