I have a complex query which essentially runs a finite state
automaton through a with recursive union, adding the next state
based on the previous. This is run at 100,000 or a million start
states at the same time, picking a new record (token), matching it
to the FSA (a three-way join:
token inner join next token * state-transition-table -> next state
I know this doesn't really tell you much. The following might
give you a glimpse:
with recursive Token as ( select * from steps left outer join token using(event) limit 100000 ), StartStates as ( select pathId, start, end, m.new_state as state, m.goalId from Token w inner join FSA m on(m.token = w.token and m.old_state = w.state) ), Phrase as ( select pathId, start, end, state, goalId from StartStates union all select p.pathId, p.start, n.end, n.new_state as state, n.goalId from Phrase p inner join ( select pathId, start, end, old_state as state, new_state, f.goalId from Token inner join FSA f using(token) ) n using(pathId, end, state)
There are 100s of thousands of states. This join has a HUGE fan out if it is not immediately limited by the chaining criterion on the old_state. So any attempt to use merge join or hash join which will compute the whole big thing and only then apply the chaining criterion, will just create massive amounts of sort load and/or humongous hash tables only to throw the vast majority away every time. But when it runs through nested loops, the indexes help to make it really quick.
I cannot show you the exact data, but I can show you the plan
that works amazingly fast:
Insert on good_paths (cost=912224.51..912228.71 rows=240 width=302) CTE token -> Limit (cost=46728.24..81127.75 rows=100000 width=519) -> Hash Left Join (cost=46728.24..115752.23 rows=200654 width=519) ... this is creating the start states CTE path -> Recursive Union (cost=23293.75..831082.45 rows=241 width=278) -> Merge Join (cost=23293.75..289809.83 rows=171 width=278) Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token)) -> Index Scan using fsa_old_state_token_idx on fsa m (cost=0.43..245365.63 rows=4029834 width=28) -> Materialize (cost=23293.32..23793.32 rows=100000 width=278) -> Sort (cost=23293.32..23543.32 rows=100000 width=278) Sort Key: w_1.state, w_1.token -> CTE Scan on token w_1 (cost=0.00..2000.00 rows=100000 width=278) -> Nested Loop (cost=18295.78..54126.78 rows=7 width=278) -> Merge Join (cost=18295.35..19120.16 rows=4275 width=340) Merge Cond: ((token.pathid = p.pathid) AND (token.start = p.end)) -> Sort (cost=18169.32..18419.32 rows=100000 width=160) Sort Key: token.pathid, token.start -> CTE Scan on token (cost=0.00..2000.00 rows=100000 width=160) -> Sort (cost=126.03..130.30 rows=1710 width=212) Sort Key: p.pathid, p.end -> WorkTable Scan on path p (cost=0.00..34.20 rows=1710 width=212) -> Index Scan using fsa_old_state_token_idx on fsa f (cost=0.43..8.18 rows=1 width=28)
Now, when that initial token list (of start states) grows beyond this limit of 100,000, the execution plan flips:
Insert on good_paths (cost=2041595.63..2041606.66 rows=630 width=302) CTE token -> Limit (cost=46728.24..115752.23 rows=200654 width=519) -> Hash Left Join (cost=46728.24..115752.23 rows=200654 width=519) ... this is creating the start states CTE path -> Recursive Union (cost=47749.96..1925801.45 rows=633 width=278) -> Merge Join (cost=47749.96..315274.30 rows=343 width=278) Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token)) -> Index Scan using fsa_old_state_token_idx on fsa m (cost=0.43..245365.63 rows=4029834 width=28) -> Materialize (cost=47749.53..48752.80 rows=200654 width=278) -> Sort (cost=47749.53..48251.16 rows=200654 width=278) Sort Key: w_1.state, w_1.token -> CTE Scan on token w_1 (cost=0.00..4013.08 rows=200654 width=278) -> Merge Join (cost=158013.87..161051.45 rows=29 width=278) Merge Cond: ((token.token = f.token) AND (token.pathid = p.pathid) AND (token.start = p.end)) -> Sort (cost=37459.53..37961.16 rows=200654 width=160) Sort Key: token.token, token.pathid, token.start -> CTE Scan on token (cost=0.00..4013.08 rows=200654 width=160) -> Materialize (cost=120554.35..120966.44 rows=82419 width=228) -> Sort (cost=120554.35..120760.39 rows=82419 width=228) Sort Key: f.token, p.pathid, p.end -> Nested Loop (cost=0.43..104808.55 rows=82419 width=228) -> WorkTable Scan on path p (cost=0.00..68.60 rows=3430 width=212) -> Index Scan using fsa_old_state_token_idx on fsa f (cost=0.43..30.30 rows=24 width=28)
Once this merge join kicks in, the query essentially stalls (I
mean, each of the limited components runs in seconds, and I can
iteratively run them so that my initial set of tokens never grows
past 100,000, and then I can complete everything in about linear
time, each iteration takes about linear time proportional with the
amount of tokens. But with the merge join it doesn't complete
before several times that amount of time.
I doubt that I can find any trick to give to the planner better
data which it can then use to figure out that the merge join is a
bad proposition.
I wish I could just force it. I probably had this discussion here some years ago. I think that while the PostgreSQL optimizer is pretty good, there are situations such as this where its predictions do not work.
Note, for my immediate relief I have forced it by simply set enable_mergejoin=off. This works fine, except, it converts both into a nested loop, but the upper merge join was not a problem, and sometimes (most often) nested loop is a bad choice for bulk data. It's only for this recursive query it sometimes makes sense.
regards,
-Gunther