Re: Subselect query enhancement

Richard Huxton <dev@xxxxxxxxxxxx> · Fri, 02 Feb 2007 09:51:42 +0000

Michael Artz wrote:
Here are some numbers for 3 different queries using a very selective
query (port = 2222).  I'm thinking that, since the row estimates are
different from the actuals (2 vs 2000), that this particular port
didn't make it into the statistics ... is that true?  Does this
matter?  If this isn't enough data, I can type up some more.

SELECT * FROM pg_stats WHERE tablename='foo';
This will show you frequency-stats on each column (as generated by 
analyse). You're interested in n_distinct, most_common_vals, 
most_common_freqs.

One thing that I forgot to mention is that the network_events and
host_events tables are partitioned by first octet of the IP, so when I
say "various scans of ..." that means that there is a scan of each of
the partitions, the type determined by the planner and the statistics,
I assume.

So you've got xxx_events tables partitioned by ip, but ip_info is one 
table? Do you do a lot of scans across the bottom 3 bytes of the IP? If 
not, I'm not clear what we're gaining from the partitioning.

** Dual subselects:
SELECT * FROM ip_info
WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222)
      OR ip IN (SELECT ip FROM host_events WHERE port = 2222);

Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual
time=34..8238 rows=234 loops=1)
Filter:  ... AND ((hashed_subplan) OR (hashed_subplan))

Well, the estimate here is rubbish - 9.3 million rows whereas we 
actually get 234. Now we know you're likely to get a lot of overlap, and 
the planner might not realise that. Still - that looks very bad. Of 
course, because it's expecting so many rows a seq-scan of ip_info looks 
like a good choice to it.

** Single subselect:

SELECT * FROM ip_info
WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222
            UNION
            SELECT DISTINCT ip FROM host_events WHERE port = 2222);

Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106
rows=2349 loops=1)

This is clearly a lot better, Not sure whether the DISTINCT in each 
subquery works or not.

** The join:

SELECT distinct ip_info.*
 FROM ip_info RIGHT JOIN network_events USING (ip)
      RIGHT JOIN  host_events USING (ip)
 WHERE (network_events.port=2222 OR host_events.port=2222)

Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52 
loops=1)
 -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61
rows=63 loops=1)

OK, so what do the plans look like for port=80 or something larger like 
that?

Then try adding an index to the various host/network_events tables
CREATE INDEX ... ON ... (ip) WHERE port=80;

--
  Richard Huxton
  Archonet Ltd