Re: Subselect query enhancement

"Michael Artz" <mlartz@xxxxxxxxx> · Thu, 1 Feb 2007 13:27:32 -0500

 > I have primary table that holds ip information
> and two other tables that hold event data for the specific IP in with
> a one-to-many mapping between them, ie:
[snip]
> There is quite a bit of commonality between the network_events and
> host_events schemas, but they do not currently share an ancestor.
> ip_info has about 13 million rows, the network_events table has about
> 30 million rows, and the host_events table has about 7 million rows.
> There are indexes on all the rows.

What indexes though. Do you have (name,ip) on the two event tables?

All the columns are indexed individually.  The tables are completely
static, as I reload the whole DB with new data every day.

How selective is "name" - are there many different values or just a few?
If lots, it might be worth increasing the statistics gathered on that
column (ALTER COLUMN ... SET STATISTICS).
http://www.postgresql.org/docs/8.2/static/sql-altertable.html

I guess that is the heart of my question.  "name" is not very
selective (there are only 20 or so choices) however other columns are
fairly selective for certain cases, such as 'port'.  When querying on
and unusual port, the query is very fast, and the single UNIONed
subselect returns quickly.  When 'port' is not very selective (like
port = '80', which is roughly 1/2 of the rows in the DB), the dual
subselect query wins, hands-down.

And I have altered the statistics via the config file:
 default_statistics_target = 100
Perhaps this should be even higher for certain columns?

> The query that I would like to execute is to select all the rows of
> ip_info that have either  network or host events that meet some
> criteria, i.e. name='blah'.  I have 3 different possibilities that I
> have thought of to execute this.
>
> First, 2 'ip IN (SELECT ...)' statements joined by an OR:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah')
>        OR ip IN (SELECT ip FROM host_events WHERE name='blah');
>
> Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah'
>              UNION
>              SELECT ip FROM host_events WHERE name='blah');
>
> Or, finally, the UNION statment with DISTINCTs:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah'
>              UNION
>              SELECT DISTINCT ip FROM host_events WHERE name='blah');
>
>  From what I have read, the UNION statement does an implicit DISTINCT,
> but I thought that doing it on each of the tables would result in
> slightly faster execution.  Can you think of any other ways to
> implement the previous query?

You're right about removing duplicates. Not sure whether the DISTINCTs
on the sub-selects are helping or hindering. It'll probably depend on
your hardware, config, number of rows etc.

The only other way I can think of for this query is to UNION two JOINs.
Might interact well with the (name,ip) index I mentioned above.

Nah, that did very poorly.

> I have explained/analyzed all the queries but, unfortunately, they are
> on an isolated computer.  The gist is that, for relatively
> low-incidence values of name, the UNION performs better, but for
> queries on a common name, the dual-subselect query performs better.

Difficult to say much without seeing the full explain analyse. Did the
row estimates look reasonable?

hmm, I think so, but I'm not that good in reading the outputs.  I'll
see if I can retype some of the interesting bits of the explain
analyze.