When you really want to force a certain join type?

Gunther Schadow <raj@xxxxxxxx> · Wed, 28 Dec 2022 10:39:14 -0500



    I have a complex query which essentially runs a finite state
      automaton through a with recursive union, adding the next state
      based on the previous.  This is run at 100,000 or a million start
      states at the same time, picking a new record (token), matching it
      to the FSA (a three-way join: 

    
      token inner join next token * state-transition-table -> next
        state

      
    I know this doesn't really tell you much. The following might
      give you a glimpse:

    
    with recursive Token as (
  select * from steps left outer join token using(event)
  limit 100000
), StartStates as (
select pathId, start, end, m.new_state as state, m.goalId
  from Token w inner join FSA m
    on(m.token = w.token and m.old_state = w.state)
), Phrase as (
select pathId, start, end, state, goalId
  from StartStates
union all
select p.pathId, p.start, n.end, n.new_state as state, n.goalId
  from Phrase p
  inner join (
    select pathId, start, end, old_state as state, new_state, f.goalId
      from Token inner join FSA f using(token)
  ) n using(pathId, end, state)


    There are 100s of thousands of states. This join has a HUGE fan
      out if it is not immediately limited by the chaining criterion on
      the old_state. So any attempt to use merge join or hash join which
      will compute the whole big thing and only then apply the chaining
      criterion, will just create massive amounts of sort load and/or
      humongous hash tables only to throw the vast majority away every
      time. But when it runs through nested loops, the indexes help to
      make it really quick.
    I cannot show you the exact data, but I can show you the plan
      that works amazingly fast:

    
     Insert on good_paths  (cost=912224.51..912228.71 rows=240 width=302)
   CTE token
     ->  Limit  (cost=46728.24..81127.75 rows=100000 width=519)
           ->  Hash Left Join  (cost=46728.24..115752.23 rows=200654 width=519)
               ... this is creating the start states
   CTE path
     ->  Recursive Union  (cost=23293.75..831082.45 rows=241 width=278)
           ->  Merge Join  (cost=23293.75..289809.83 rows=171 width=278)
                 Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token))
                 ->  Index Scan using fsa_old_state_token_idx on fsa m  (cost=0.43..245365.63 rows=4029834 width=28)
                 ->  Materialize  (cost=23293.32..23793.32 rows=100000 width=278)
                       ->  Sort  (cost=23293.32..23543.32 rows=100000 width=278)
                             Sort Key: w_1.state, w_1.token
                             ->  CTE Scan on token w_1  (cost=0.00..2000.00 rows=100000 width=278)
           ->  Nested Loop  (cost=18295.78..54126.78 rows=7 width=278)
                 ->  Merge Join  (cost=18295.35..19120.16 rows=4275 width=340)
                       Merge Cond: ((token.pathid = p.pathid) AND (token.start = p.end))
                       ->  Sort  (cost=18169.32..18419.32 rows=100000 width=160)
                             Sort Key: token.pathid, token.start
                             ->  CTE Scan on token (cost=0.00..2000.00 rows=100000 width=160)
                       ->  Sort  (cost=126.03..130.30 rows=1710 width=212)
                             Sort Key: p.pathid, p.end
                             ->  WorkTable Scan on path p  (cost=0.00..34.20 rows=1710 width=212)
                 ->  Index Scan using fsa_old_state_token_idx on fsa f  (cost=0.43..8.18 rows=1 width=28)


    Now, when that initial token list (of start states) grows beyond
      this limit of 100,000, the execution plan flips:
     Insert on good_paths  (cost=2041595.63..2041606.66 rows=630 width=302)
   CTE token
     ->  Limit  (cost=46728.24..115752.23 rows=200654 width=519)
           ->  Hash Left Join  (cost=46728.24..115752.23 rows=200654 width=519)
               ... this is creating the start states
   CTE path
     ->  Recursive Union  (cost=47749.96..1925801.45 rows=633 width=278)
           ->  Merge Join  (cost=47749.96..315274.30 rows=343 width=278)
                 Merge Cond: ((m.old_state = w_1.state) AND (m.token = w_1.token))
                 ->  Index Scan using fsa_old_state_token_idx on fsa m  (cost=0.43..245365.63 rows=4029834 width=28)
                 ->  Materialize  (cost=47749.53..48752.80 rows=200654 width=278)
                       ->  Sort  (cost=47749.53..48251.16 rows=200654 width=278)
                             Sort Key: w_1.state, w_1.token
                             ->  CTE Scan on token w_1  (cost=0.00..4013.08 rows=200654 width=278)
           ->  Merge Join  (cost=158013.87..161051.45 rows=29 width=278)
                 Merge Cond: ((token.token = f.token) AND (token.pathid = p.pathid) AND (token.start = p.end))
                 ->  Sort  (cost=37459.53..37961.16 rows=200654 width=160)
                       Sort Key: token.token, token.pathid, token.start
                       ->  CTE Scan on token (cost=0.00..4013.08 rows=200654 width=160)
                 ->  Materialize  (cost=120554.35..120966.44 rows=82419 width=228)
                       ->  Sort  (cost=120554.35..120760.39 rows=82419 width=228)
                             Sort Key: f.token, p.pathid, p.end
                             ->  Nested Loop  (cost=0.43..104808.55 rows=82419 width=228)
                                   ->  WorkTable Scan on path p  (cost=0.00..68.60 rows=3430 width=212)
                                   ->  Index Scan using fsa_old_state_token_idx on fsa f  (cost=0.43..30.30 rows=24 width=28)


    Once this merge join kicks in, the query essentially stalls (I
      mean, each of the limited components runs in seconds, and I can
      iteratively run them so that my initial set of tokens never grows
      past 100,000, and then I can complete everything in about linear
      time, each iteration takes about linear time proportional with the
      amount of tokens. But with the merge join it doesn't complete
      before several times that amount of time.

    
    I doubt that I can find any trick to give to the planner better
      data which it can then use to figure out that the merge join is a
      bad proposition. 

    
    I wish I could just force it. I probably had this discussion here
      some years ago. I think that while the PostgreSQL optimizer is
      pretty good, there are situations such as this where its
      predictions do not work.
    Note, for my immediate relief I have forced it by simply set
      enable_mergejoin=off. This works fine, except, it converts both
      into a nested loop, but the upper merge join was not a problem,
      and sometimes (most often) nested loop is a bad choice for bulk
      data. It's only for this recursive query it sometimes makes sense.
    regards,

      -Gunther