Apparent missed query optimization with self-join and inner grouping

Zack Weinberg <zackw@xxxxxxxxx> · Fri, 31 Jul 2020 11:07:39 -0400

I have a table recording the results of a web crawl.  (Table
definition at the end of this message.)  The relevant part of the data
stored in it looks like this:

  id  | url_id | full_url_id | experiment_id | redirect_num
------+--------+-------------+---------------+--------------
 2617 |   1312 |        1312 |            16 |            0
 2618 |   1312 |        2311 |            16 |            1
 2619 |   1312 |        2312 |            16 |            2
 2620 |   1312 |        2313 |            16 |            3
 2631 |   1320 |        1320 |            43 |            0
 2633 |   1320 |        2312 |            43 |            2
 2632 |   1320 |        2317 |            43 |            1
 2634 |   1320 |        2318 |            43 |            3

For each (experiment_id, url_id) pair for some small subset of the
experiment_ids, I need to query the full_url_id corresponding to the
*largest* value of redirect_num.  The query planner does something
reasonable with this SELECT:

=> explain (analyze, verbose)
   select b.experiment_id, b.url_id, b.full_url_id
     from blockpage b,
          (select experiment_id, url_id, max(redirect_num) as redirect_num
             from blockpage group by experiment_id, url_id) bm
    where b.experiment_id = bm.experiment_id
      and b.url_id = bm.url_id
      and b.redirect_num = bm.redirect_num
      and bm.experiment_id in (16, 43);

 Nested Loop  (cost=1.14..88505.96 rows=20 width=12) (actual
time=0.041..1.723 rows=501 loops=1)
   Output: b.experiment_id, b.url_id, b.full_url_id
   ->  GroupAggregate  (cost=0.57..15442.73 rows=8543 width=12)
(actual time=0.033..0.501 rows=501 loops=1)
         Output: blockpage.experiment_id, blockpage.url_id,
max(blockpage.redirect_num)
         Group Key: blockpage.experiment_id, blockpage.url_id
         ->  Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage  (cost=0.57..15293.19 rows=8547 width=12) (actual
time=0.026..0.283 rows=803 loops=1)
               Output: blockpage.experiment_id, blockpage.url_id,
blockpage.full_url_id, blockpage.redirect_num, blockpage.html_tag_id
               Index Cond: (blockpage.experiment_id = ANY
('{16,43}'::integer[]))
               Heap Fetches: 803
   ->  Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage b  (cost=0.57..8.53 rows=1 width=16) (actual
time=0.002..0.002 rows=1 loops=501)
         Output: b.experiment_id, b.url_id, b.full_url_id,
b.redirect_num, b.html_tag_id
         Index Cond: ((b.experiment_id = blockpage.experiment_id) AND
(b.url_id = blockpage.url_id) AND (b.redirect_num =
(max(blockpage.redirect_num))))
         Heap Fetches: 501
 Planning Time: 0.331 ms
 Execution Time: 1.784 ms

But if I change the final part of the WHERE to reference
b.experiment_id instead of bm.experiment_id, I get this much more
expensive query plan:

=> explain (analyze, verbose)
   select b.experiment_id, b.url_id, b.full_url_id
     from blockpage b,
          (select experiment_id, url_id, max(redirect_num) as redirect_num
             from blockpage group by experiment_id, url_id) bm
    where b.experiment_id = bm.experiment_id
      and b.url_id = bm.url_id
      and b.redirect_num = bm.redirect_num
      and b.experiment_id in (16, 43);

 Hash Join  (cost=2749504.19..2764864.13 rows=2 width=12) (actual
time=144028.343..144029.545 rows=501 loops=1)
   Output: b.experiment_id, b.url_id, b.full_url_id
   Inner Unique: true
   Hash Cond: ((b.experiment_id = blockpage.experiment_id) AND
(b.url_id = blockpage.url_id) AND (b.redirect_num =
(max(blockpage.redirect_num))))
   ->  Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage b  (cost=0.57..15293.19 rows=8547 width=16) (actual
time=0.039..0.387 rows=803 loops=1)
         Output: b.experiment_id, b.url_id, b.full_url_id,
b.redirect_num, b.html_tag_id
         Index Cond: (b.experiment_id = ANY ('{16,43}'::integer[]))
         Heap Fetches: 803
   ->  Hash  (cost=2595219.62..2595219.62 rows=8816229 width=12)
(actual time=143941.931..143941.931 rows=57061228 loops=1)
         Output: blockpage.experiment_id, blockpage.url_id,
(max(blockpage.redirect_num))
         Buckets: 67108864 (originally 16777216)  Batches: 1
(originally 1)  Memory Usage: 2976138kB
         ->  HashAggregate  (cost=2418895.04..2507057.33 rows=8816229
width=12) (actual time=90020.851..122656.924 rows=57061228 loops=1)
               Output: blockpage.experiment_id, blockpage.url_id,
max(blockpage.redirect_num)
               Group Key: blockpage.experiment_id, blockpage.url_id
               ->  Seq Scan on iclab.blockpage  (cost=0.00..1757677.88
rows=88162288 width=12) (actual time=0.020..32910.709 rows=88164599
loops=1)
                     Output: blockpage.id, blockpage.url_id,
blockpage.full_url_id, blockpage.experiment_id,
blockpage.blockpage_reason_id, blockpage.html_tag_id,
blockpage.body_len, blockpage.blockpage_diff,
blockpage.has_blockpage_regex, blockpage.http_status,
blockpage.redirect_num, blockpage.response_failure
 Planning Time: 0.409 ms
 JIT:
   Functions: 17
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.098 ms, Inlining 27.226 ms, Optimization
140.416 ms, Emission 79.840 ms, Total 249.580 ms
 Execution Time: 145108.471 ms

Since we have `b.experiment_id = bm.experiment_id` as a join
condition, it seems to me that the query planner ought to have
recognized that the `experiment_id in (16, 43)` condition could be
pushed down into the sub-select in both cases.

What is the best way to report this to the developers?  Should I file
a bug report?  I'm using Postgres 12.2.

Thanks,
zw

CREATE TABLE iclab.blockpage (
    id                          BIGSERIAL   PRIMARY KEY,
    url_id                      INTEGER     NOT NULL,
    full_url_id                 INTEGER     NOT NULL,
    experiment_id               INTEGER     NOT NULL,
    blockpage_reason_id         INTEGER,
    html_tag_id                 INTEGER     NOT NULL,
    body_len                    INTEGER,
    blockpage_diff              REAL,
    has_blockpage_regex         BOOLEAN,
    http_status                 INTEGER,
    redirect_num                INTEGER     NOT NULL,
    response_failure            INTEGER,
    FOREIGN KEY (experiment_id)         REFERENCES iclab.experiment_results(id),
    FOREIGN KEY(url_id)                 REFERENCES iclab.URL(id),
    FOREIGN KEY(url_id)                 REFERENCES iclab.URL(id),
    FOREIGN KEY(blockpage_reason_id)    REFERENCES iclab.blockpage_reason(id),
    FOREIGN KEY(html_tag_id)            REFERENCES iclab.html_tag(id)
);

create unique index
blockpage__experiment_id_url_id_redirect_num_blockpage_reason
    on iclab.blockpage(experiment_id , url_id, full_url_id,
redirect_num, html_tag_id);