Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This indeed works around the issue. Thanks!

On Mon, Nov 12, 2012 at 9:53 AM, ashutosh durugkar <ashucould@xxxxxxxxx> wrote:
> Hey Rafal,
>
>
>>SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
> run_id IN (SELECT run_id FROM runs WHERE server_id = 515);
>
> could you try this:
>
>
> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
> run_id = ANY(ARRAY(SELECT run_id FROM runs WHERE server_id = 515));
>
> Thanks,
>
> On Sun, Nov 11, 2012 at 8:48 AM, Rafał Rzepecki <divided.mind@xxxxxxxxx>
> wrote:
>>
>> [Please CC me on replies, as I'm not subscribed; thank you.]
>>
>> I've ran into a problem with the query planner and IN (subquery)
>> conditions which I suspect to be a bug. I'll attempt to describe the
>> relevant details of my database and explain which behaviour I find
>> unexpected. I've also tried to trigger this behaviour in a clean
>> database; I think I've succeeded, but the conditions are a bit
>> different, so perhaps it's a different problem. I'll describe this
>> setup in detail below.
>>
>> I have a somewhat large table (~2.5M rows), stats, which is quite
>> often (several records a minute) INSERTed to, but never UPDATEd or
>> DELETEd from. (In case it's relevant, it has an attached AFTER INSERT
>> trigger which checks time and rebuilds an aggregate materialized view
>> every hour.) This is the schema:
>> # \d+ stats
>>                                                   Table
>> "serverwatch.stats"
>>       Column      |            Type             |
>> Modifiers                      | Storage | Description
>>
>> ------------------+-----------------------------+----------------------------------------------------+---------+-------------
>>  id               | integer                     | not null default
>> nextval('stats_id_seq'::regclass) | plain   |
>>  run_id           | integer                     | not null
>>                               | plain   |
>>  start_time       | timestamp without time zone | not null
>>                               | plain   |
>>  end_time         | timestamp without time zone | not null
>>                               | plain   |
>>  cpu_utilization  | double precision            |
>>                               | plain   |
>>  disk_read_ops    | bigint                      |
>>                               | plain   |
>>  disk_write_ops   | bigint                      |
>>                               | plain   |
>>  network_out      | bigint                      |
>>                               | plain   |
>>  network_in       | bigint                      |
>>                               | plain   |
>>  disk_read_bytes  | bigint                      |
>>                               | plain   |
>>  disk_write_bytes | bigint                      |
>>                               | plain   |
>> Indexes:
>>     "stats_pkey" PRIMARY KEY, btree (id)
>>     "stats_day_index" btree (run_id, day(stats.*))
>>     "stats_month_index" btree (run_id, month(stats.*))
>>     "stats_week_index" btree (run_id, week(stats.*))
>> Foreign-key constraints:
>>     "stats_runs" FOREIGN KEY (run_id) REFERENCES runs(id)
>> Triggers:
>>     stats_day_refresh_trigger AFTER INSERT OR UPDATE ON stats FOR EACH
>> STATEMENT EXECUTE PROCEDURE mat_view_refresh('serverwatch.stats_day')
>> Has OIDs: no
>>
>> day(), month() and week() functions are just trivial date_trunc on a
>> relevant field. The referenced table looks like this:
>> # \d+ runs
>>                                                  Table "serverwatch.runs"
>>      Column      |            Type             |
>> Modifiers                     | Storage | Description
>>
>> -----------------+-----------------------------+---------------------------------------------------+---------+-------------
>>  id              | integer                     | not null default
>> nextval('runs_id_seq'::regclass) | plain   |
>>  server_id       | integer                     | not null
>>                             | plain   |
>>  flavor          | flavor                      | not null
>>                             | plain   |
>>  region          | region                      | not null
>>                             | plain   |
>>  launch_time     | timestamp without time zone | not null
>>                             | plain   |
>>  stop_time       | timestamp without time zone |
>>                             | plain   |
>>  project_info_id | integer                     | not null
>>                             | plain   |
>>  owner_info_id   | integer                     | not null
>>                             | plain   |
>> Indexes:
>>     "runs_pkey" PRIMARY KEY, btree (id)
>>     "index_runs_on_flavor" btree (flavor)
>>     "index_runs_on_owner_info_id" btree (owner_info_id)
>>     "index_runs_on_project_info_id" btree (project_info_id)
>>     "index_runs_on_region" btree (region)
>>     "index_runs_on_server_id" btree (server_id)
>> Foreign-key constraints:
>>     "runs_owner_info_id_fkey" FOREIGN KEY (owner_info_id) REFERENCES
>> user_infos(id)
>>     "runs_project_info_id_fkey" FOREIGN KEY (project_info_id)
>> REFERENCES project_infos(id)
>> Referenced by:
>>     TABLE "stats_day" CONSTRAINT "stats_day_runs" FOREIGN KEY (run_id)
>> REFERENCES runs(id)
>>     TABLE "stats" CONSTRAINT "stats_runs" FOREIGN KEY (run_id)
>> REFERENCES runs(id)
>> Has OIDs: no
>>
>> Now consider this query - note I'm using a subselect here because the
>> problem originally manifested itself with a view:
>> SELECT * FROM (SELECT run_id, disk_write_ops FROM stats) AS s WHERE
>> run_id IN (SELECT id FROM runs WHERE server_id = 515);
>>
>> As might be expected, the planner chooses to use one of the three
>> indices with run_id:
>> http://explain.depesz.com/s/XU3Q
>>
>> Now consider a similar query, but with aggregation:
>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY
>> run_id) AS s WHERE run_id IN (SELECT id FROM runs WHERE server_id =
>> 515);
>>
>> Now the picture is very different. The planner, unexplicably,
>> dismisses the index and opts instead to do a full scan on stats, the
>> table 2.5 million rows big.
>> http://explain.depesz.com/s/Rqt
>>
>> Note that the problem disappears when we replace the IN condition with
>> literal:
>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY
>> run_id) AS s WHERE run_id IN (1815, 1816);
>>
>> The ids are the result of the inner select ran separately, so the
>> query has the exact same result; it's worth pointing out that the
>> planner has a correct estimate on the selectivity of the condition -
>> exactly two rows from runs are selected, as expected. But when literal
>> is used the planner correctly chooses to use the index:
>> http://explain.depesz.com/s/lYc
>>
>> Similarly a correct plan is chosen when we unnest the inner SELECT:
>> SELECT run_id, SUM(disk_write_ops) FROM stats WHERE run_id IN (SELECT
>> id FROM runs WHERE server_id = 515) GROUP BY run_id;
>> http://explain.depesz.com/s/dlwZ
>>
>>
>> I've tried to replicate this on a clean database:
>> CREATE TABLE runs(run_id serial PRIMARY KEY, server_id INTEGER NOT NULL);
>> CREATE INDEX runs_server ON runs(server_id);
>> CREATE TABLE stats(entry_id serial PRIMARY KEY, run_id integer
>> REFERENCES runs NOT NULL, utilization INTEGER NOT NULL);
>> CREATE INDEX stats_runs ON stats(run_id);
>>
>> Now let's try some queries:
>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>> run_id IN (1212, 2323, 121, 561, 21, 561, 125, 2, 55, 52, 42);
>> http://explain.depesz.com/s/Kcb - fine, index used
>>
>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>> run_id IN (SELECT run_id FROM runs WHERE server_id = 515);
>> http://explain.depesz.com/s/QFs - seqscan!
>> Obviously it doesn't mean much, as the tables are empty and there are
>> no stats, but still a radically different plan is chosen for what is
>> essentially the same query.
>>
>> Note that in this case the behaviour is the same even when unnested:
>> SELECT run_id, utilization FROM stats WHERE run_id IN (SELECT run_id
>> FROM runs WHERE server_id = 515);
>> http://explain.depesz.com/s/y3GM
>>
>> So, is this a bug in the planner, or am I somehow subtly changing the
>> semantics of the query and don't notice?
>> I understand the planner perhaps tries to parallelize queries when a
>> SELECT is used in the IN clause, but given the stats it doesn't seem
>> to make much sense.
>>
>> Thanks, and let me know if you want me to test something on my
>> database over here or if there's some relevant info I've ommited.
>>
>> (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc
>> (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit
>> running on Ubuntu 12.10, ubuntu package
>> postgresql-9.1-9.1.6-1ubuntu1:amd64, default configuration)
>> --
>> Rafał Rzepecki
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>



-- 
Rafał Rzepecki


-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux