I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it?
First query:
SELECT *
FROM "lead"
WHERE ( NOT ( ( "lead"."id" IN
( SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" = 'type_1' )
OR ( "lead"."id" IN
( SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" = 'type_2' )
AND "lead"."id" IN
( SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE UPPER(U1."response"::text) LIKE UPPER('%success%') ) ) ) )
AND NOT ("lead"."ReferenceNumber" = '') ) ;
explain/analyze result:
Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1)
Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3))))
Rows Removed by Filter: 375369
SubPlan 1
-> Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4) (actual time=0.005..171.350 rows=7414 loops=1)
Filter: ((event_type)::text = 'type_1'::text)
Rows Removed by Filter: 1099436
SubPlan 2
-> Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4) (actual time=0.006..219.092 rows=373298 loops=1)
Filter: ((event_type)::text = 'type_2'::text)
Rows Removed by Filter: 733552
SubPlan 3
-> Seq Scan on event u1_2 (cost=0.00..45175.75 rows=111 width=4) (actual time=0.040..3389.550 rows=712952 loops=1)
Filter: (upper(response) ~~ '%SUCCESS%'::text)
Rows Removed by Filter: 393898
The main thing that sticks out to me for this plan is the low estimate for the rows it will return on the %SUCCESS% filter.
Here is the second query with explain:
SELECT *
FROM "lead"
WHERE
(
NOT
(
("lead"."id" IN
(
SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" ='type_1'
)
OR
("lead"."id" IN
(
SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE U1."event_type" = 'type_2
)
AND "lead"."id" IN
(
SELECT U1."lead_id" AS "lead_id"
FROM "event" U1
WHERE successful
)
)
)
)
AND NOT ("lead"."ReferenceNumber" = '')
) ;
explain result:
Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369)
Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3))))
SubPlan 1
-> Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4)
Filter: ((event_type)::text = 'type_1'::text)
SubPlan 2
-> Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4)
Filter: ((event_type)::text = 'type_2'::text)
SubPlan 3
-> Materialize (cost=0.00..46154.43 rows=731185 width=4)
-> Seq Scan on event u1_2 (cost=0.00..39641.50 rows=731185 width=4)
Filter: successful
Here is does a materialize and estimates rows properly, but as stated this query just hangs and pegs load. There are no locks and its in an active state the whole time. I am running these queries in a test environment on a recently exported full schema from production, with a reindex and a vacuum/analyze. This is postgres 9.3.6 on rhel6.
When I run just the different subquery element:
SELECT U1."lead_id" AS "lead_id" FROM "event" U1 WHERE successful;
it returns in about 250ms, with the text field %SUCCESS% it runs in about 4 seconds. This seemed like a low hanging fruit query improvement so I'm surprised its not working, it seems like we are just lucky that the planner is estimating that filter incorrectly in the original form.
I'm sure the query just needs to be completely overhauled and am starting to pull it apart and work with the engineers to get something more efficient set up overall, but I am not sure how to answer the question as to why this original attempt at improving the query is not successful.
Any guidance is greatly appreciated, thanks!