Re: Subselect query enhancement

Richard Huxton <dev@xxxxxxxxxxxx> · Thu, 01 Feb 2007 17:23:47 +0000

Michael Artz wrote:
I'm needing help determining the best all-around query for the
following situation. 

Not sure whether such a beast exists, but...

> I have primary table that holds ip information
and two other tables that hold event data for the specific IP in with
a one-to-many mapping between them, ie:
[snip]
There is quite a bit of commonality between the network_events and
host_events schemas, but they do not currently share an ancestor.
ip_info has about 13 million rows, the network_events table has about
30 million rows, and the host_events table has about 7 million rows.
There are indexes on all the rows.

What indexes though. Do you have (name,ip) on the two event tables?

How selective is "name" - are there many different values or just a few? 
If lots, it might be worth increasing the statistics gathered on that 
column (ALTER COLUMN ... SET STATISTICS).
http://www.postgresql.org/docs/8.2/static/sql-altertable.html

The query that I would like to execute is to select all the rows of
ip_info that have either  network or host events that meet some
criteria, i.e. name='blah'.  I have 3 different possibilities that I
have thought of to execute this.

First, 2 'ip IN (SELECT ...)' statements joined by an OR:

SELECT * FROM ip_info
 WHERE ip IN (SELECT ip FROM network_events WHERE name='blah')
       OR ip IN (SELECT ip FROM host_events WHERE name='blah');

Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement:

SELECT * FROM ip_info
 WHERE ip IN (SELECT ip FROM network_events WHERE name='blah'
             UNION
             SELECT ip FROM host_events WHERE name='blah');

Or, finally, the UNION statment with DISTINCTs:

SELECT * FROM ip_info
 WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah'
             UNION
             SELECT DISTINCT ip FROM host_events WHERE name='blah');

 From what I have read, the UNION statement does an implicit DISTINCT,
but I thought that doing it on each of the tables would result in
slightly faster execution.  Can you think of any other ways to
implement the previous query?

You're right about removing duplicates. Not sure whether the DISTINCTs 
on the sub-selects are helping or hindering. It'll probably depend on 
your hardware, config, number of rows etc.

The only other way I can think of for this query is to UNION two JOINs. 
Might interact well with the (name,ip) index I mentioned above.

I have explained/analyzed all the queries but, unfortunately, they are
on an isolated computer.  The gist is that, for relatively
low-incidence values of name, the UNION performs better, but for
queries on a common name, the dual-subselect query performs better.

Difficult to say much without seeing the full explain analyse. Did the 
row estimates look reasonable?

--
  Richard Huxton
  Archonet Ltd