Re: Hash Join over Nested Loop

Pavel Stehule <pavel.stehule@xxxxxxxxx> · Fri, 22 Nov 2019 19:54:43 +0100

pá 22. 11. 2019 v 19:42 odesílatel Luís Roberto Weck <luisroberto@xxxxxxxxxxxxxxx> napsal:

    Em 22/11/2019 14:55, Pavel Stehule
      escreveu:

          pá 22. 11. 2019 v 18:37
            odesílatel Luís Roberto Weck <luisroberto@xxxxxxxxxxxxxxx>
            napsal:

              Hey, 

                  I'm trying to figure out why Postgres is choosing a
                  Hash Join over a Nested Loop in this query:

                  SELECT T1.PesID, T1.PesNom, T1.PesValSal,
                    T1.PesCPFCNPJ, T2.CarAti, T1.CarCod, T1.EmpCod, 

                           T2.CarFan, T1.PesDatAge, T1.PesCod, COALESCE(
                    T3.PesDatAnt, DATE '00010101') AS PesDatAnt 

                    FROM ((public.Pessoa T1 

                              INNER JOIN public.Carteira T2 ON
                    T2.EmpCod = T1.EmpCod AND T2.CarCod = T1.CarCod) 

                               LEFT JOIN  (SELECT MIN(COALESCE(
                    T5.ConVenAnt, DATE '00010101')) AS PesDatAnt,
                    T4.EmpCod, T4.CarCod, T4.ConPesCod AS ConPesCod 

                                             FROM
                    (public.Contrato T4 

                                                   LEFT JOIN 
                    (SELECT MIN(ConParDatVen) AS ConVenAnt, EmpCod,
                    CarCod, ConPesCod, ConSeq 

                    FROM public.ContratoParcela T5

                    WHERE ConParAti = true 

                    AND ConParValSal > 0 

                  GROUP BY EmpCod, CarCod, ConPesCod, ConSeq )
                    T5 ON T5.EmpCod    = T4.EmpCod    AND 

                    T5.CarCod    = T4.CarCod    AND 

                    T5.ConPesCod = T4.ConPesCod AND 

                    T5.ConSeq    = T4.ConSeq) 

                                            WHERE T4.ConAti =
                    TRUE

                                            GROUP BY
                    T4.EmpCod, T4.CarCod, T4.ConPesCod ) T3 ON
                    t3.EmpCod    = T1.EmpCod AND 

                                              t3.CarCod    = T1.CarCod
                    AND 

                                              t3.ConPesCod = T1.PesCod)

                   WHERE (T2.CarAti = true)

                     AND (T1.EmpCod = 112)

                     and (UPPER(T1.PesNom) like
                    UPPER('%MARIA%')) 

                   ORDER BY T1.EmpCod, T1.CarCod, T1.PesCod

                  Here the Hash Join[1] plan takes ~700ms, and if I
                  change the first LEFT JOIN to a LEFT JOIN LATERAL,
                  forcing a nested loop, the query[2] runs in 3ms.

                  [1] https://explain.depesz.com/s/8IL3

                  [2] https://explain.depesz.com/s/f8Q9

          Maybe I am wrong, but probably you have to do more than
            just change LEFT JOIN to LATERAL JOIN. Lateral join is based
            on correlated subquery - so you had to push some predicates
            to subquery - and then the query can be much more effective.

          Regards

          Pavel

              PostgreSQL version is 11.5, I have run analyze on all the
              tables.

              PG settings:

              name                           |setting  |unit|

              -------------------------------|---------|----|

              autovacuum                     |on       |    |

              default_statistics_target      |250      |    |

              effective_cache_size           |983040   |8kB |

              effective_io_concurrency       |200      |    |

              max_parallel_workers           |6        |    |

              max_parallel_workers_per_gather|3        |    |

              random_page_cost               |1.1      |    |

              work_mem                       |51200    |kB  |

    I'm sorry, I am not sure I understood.

    This is the altered query:

    SELECT T1.PesID, T1.PesNom, T1.PesValSal, T1.PesCPFCNPJ,
      T2.CarAti, T1.CarCod, T1.EmpCod, T2.CarFan, T1.PesDatAge,
      T1.PesCod, 

           COALESCE( T3.PesDatAnt, DATE '00010101') AS
      PesDatAnt 

      FROM ((public.Pessoa T1 

                INNER JOIN public.Carteira T2 ON T2.EmpCod =
      T1.EmpCod AND T2.CarCod = T1.CarCod) 

                 LEFT JOIN LATERAL (SELECT
      MIN(COALESCE( T5.ConVenAnt, DATE '00010101')) AS PesDatAnt,
      T4.EmpCod, T4.CarCod, T4.ConPesCod AS ConPesCod 

                                      FROM (public.Contrato T4

                                            LEFT JOIN  (SELECT
      MIN(ConParDatVen) AS ConVenAnt, EmpCod, CarCod, ConPesCod, ConSeq

                                                         FROM
      public.ContratoParcela T5

                                                        WHERE
      ConParAti = true 

                                                          and
      ConParValSal > 0 

                                                        GROUP
      BY EmpCod, CarCod, ConPesCod, ConSeq ) T5 ON  T5.EmpCod =
      T4.EmpCod AND T5.CarCod = T4.CarCod AND T5.ConPesCod =
      T4.ConPesCod AND T5.ConSeq = T4.ConSeq) 

                                     WHERE T4.ConAti = TRUE

                                       AND t4.EmpCod =
        T1.EmpCod AND t4.CarCod = T1.CarCod AND t4.ConPesCod = T1.PesCod

                                     GROUP BY T4.EmpCod,
      T4.CarCod, T4.ConPesCod ) T3 ON TRUE ) --ON t3.EmpCod =
        T1.EmpCod AND t3.CarCod = T1.CarCod AND t3.ConPesCod =
        T1.PesCod) 

     WHERE (T2.CarAti = true)

       AND (T1.EmpCod = 112)

       and (UPPER(T1.PesNom) like UPPER('%MARIA%')) 

     ORDER BY T1.EmpCod, T1.CarCod, T1.PesCod

    In bold are the changes I've made to the query. I am sure
    PostgreSQL is able to push it down, since it is much faster now.
    The problem I have is that this is a query generated by an ORM,
    So I can't change it.

    I would like to understand why wasn't Postgres able to optimize
    it to a nested loop. Is there something I can do with the
    statistics?

I don't think. Postgres optimizer just doesn't support this optimization. It has sense only when you know so number of loops is very small - else nested loop should be much slower. 

Regards

Pavel

    Thanks!!