Re:

Brad DeJong <Brad.Dejong@xxxxxxxxx> · Wed, 28 Jun 2017 13:10:13 +0000

On 2017-06-28, Pavel Stehule wrote ...
> On 2017-06-28, Yevhenii Kurtov wrote ...
>> On 2017-06-28, Pavel Stehule wrote ...
>>> On 2017-06-28, Yevhenii Kurtov wrote ...
>>>> We have a query that is run almost each second and it's very important to squeeze every other ms out of it. The query is:
>>>> ...
>>>> I added following index: CREATE INDEX ON campaign_jobs(id, status, failed_at, started_at, priority DESC, times_failed);
>>>> ...
>>> There are few issues 
>>> a) parametrized LIMIT
>>> b) complex predicate with lot of OR 
>>> c) slow external sort
>>>
>>> b) signalize maybe some strange in design .. try to replace "OR" by "UNION" query
>>> c) if you can and you have good enough memory .. try to increase work_mem .. maybe 20MB
>>>
>>> if you change query to union queries, then you can use conditional indexes
>>>
>>> create index(id) where status = 0;
>>> create index(failed_at) where status = 2;
>>> create index(started_at) where status = 1;
>>
>> Can you please give a tip how to rewrite the query with UNION clause?
>
> SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT $7
> FOR UPDATE SKIP LOCKED

Normally (at least for developers I've worked with), that kind of query structure is used when the "status" values don't overlap and don't change from query to query. Judging from Pavel's suggested conditional indexes (i.e. "where status = <constant>"), he also thinks that is likely.

Give the optimizer that information so that it can use it. Assuming $1 = 0 and $3 = 2 and $5 = 1, substitute literals. Substitute literal for $7 in limit. Push order by and limit to each branch of the union all (or does Postgres figure that out automatically?) Replace union with union all (not sure about Postgres, but allows other dbms to avoid sorting and merging result sets to eliminate duplicates). (Use of UNION ALL assumes that "id" is unique across rows as implied by only "id" being selected with FOR UPDATE. If multiple rows can have the same "id", then use UNION to eliminate the duplicates.)

SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
ORDER BY "priority" DESC, "times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED

Another thing that you could try is to push the ORDER BY and LIMIT to the branches of the UNION (or does Postgres figure that out automatically?) and use slightly different indexes. This may not make sense for all the branches but one nice thing about UNION is that each branch can be tweaked independently. Also, there are probably unmentioned functional dependencies that you can use to reduce the index size and/or improve your match rate. Example - if status = 1 means that the campaign_job has started but not failed or completed, then you may know that started_at is set, but failed_at and ended_at are null. The < comparison in and of itself implies that only rows where "started_at" is not null will match the condition.

SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND NOT (c0."id" = ANY($1)))) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND (c0."failed_at" > $2)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND (c0."started_at" < $3)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED

Including the "priority", "times_failed" and "id" columns in the indexes along with "failed_at"/"started_at" allows the optimizer to do index only scans. (May still have to do random I/O to the data page to determine tuple version visibility but I don't think that can be eliminated.)

create index ... ("priority" desc, "times_failed", "id")               where "status" = 0;
create index ... ("priority" desc, "times_failed", "id", "failed_at")  where "status" = 2 and "failed_at" is not null;
create index ... ("priority" desc, "times_failed", "id", "started_at") where "status" = 1 and "started_at" is not null; -- and ended_at is null and ...

I'm assuming that the optimizer knows that "where status = 1 and started_at < $3" implies "and started_at is not null" and will consider the conditional index. If not, then the "and started_at is not null" needs to be explicit.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance