anti-join with small table via text/varchar cannot estimate rows correctly

Stefan Andreatta <s.andreatta@xxxxxxxxxxx> · Wed, 1 Mar 2017 23:00:06 +0100



    Hello,

    
    I have encountered a strange problem when doing an anti-join with a
    very small table via a varchar or text field as opposed to an
    integer field. Postgres version is 9.5.3

    
    I did some experiments to extract the problem in a simple form.
    FIrst generate two tables with a series of numbers - once as
    integers once as text. The first table has 10,000 rows the second
    table just one:

    
    =# select generate_series(1, 10000) as id, generate_series(1,10000)::text as text into table tmp_san_1;
SELECT 10000
=# select generate_series(1, 1) as id, generate_series(1,1)::text as text into table tmp_san_2;
SELECT 1

=# analyze tmp_san_1;
ANALYZE
=# analyze tmp_san_2;
ANALYZE

=# \d tmp_san_*
   Table "public.tmp_san_1"
 Column |  Type   | Modifiers 
--------+---------+-----------
 id     | integer | 
 text   | text    | 

   Table "public.tmp_san_2"
 Column |  Type   | Modifiers 
--------+---------+-----------
 id     | integer | 
 text   | text    | 


    Now I do an anti-join between the two tables via the id field
    (integer). The number of resulting rows are estimated correctly as
    9,999:

    
=# explain analyze 
     select tmp_san_1.id 
     from tmp_san_1 
       left join tmp_san_2 on tmp_san_1.id = tmp_san_2.id 
     where tmp_san_2.id is null;
     
                                                   QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=1.02..281.26 rows=9999 width=4) (actual time=0.019..2.743 rows=9999 loops=1)
   Hash Cond: (tmp_san_1.id = tmp_san_2.id)
   ->  Seq Scan on tmp_san_1  (cost=0.00..154.00 rows=10000 width=4) (actual time=0.007..1.023 rows=10000 loops=1)
   ->  Hash  (cost=1.01..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on tmp_san_2  (cost=0.00..1.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=1)
 Planning time: 0.138 ms
 Execution time: 3.218 ms
(8 rows)


    The same anti-join using the text fields, however estimates just 1
    resulting row, while there are still of course 9,999 of them:

    =# explain analyze 
     select tmp_san_1.id 
     from tmp_san_1 
       left join tmp_san_2 on tmp_san_1.text = tmp_san_2.text 
     where tmp_san_2.id is null;

                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1.02..192.53 rows=1 width=4) (actual time=0.020..3.091 rows=9999 loops=1)
   Hash Cond: (tmp_san_1.text = tmp_san_2.text)
   Filter: (tmp_san_2.id IS NULL)
   Rows Removed by Filter: 1
   ->  Seq Scan on tmp_san_1  (cost=0.00..154.00 rows=10000 width=8) (actual time=0.008..0.983 rows=10000 loops=1)
   ->  Hash  (cost=1.01..1.01 rows=1 width=6) (actual time=0.004..0.004 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on tmp_san_2  (cost=0.00..1.01 rows=1 width=6) (actual time=0.002..0.002 rows=1 loops=1)
 Planning time: 0.173 ms
 Execution time: 3.546 ms
(10 rows)


    I cannot explain that behavior and much less think of a fix or
    workaround. Unfortunately my real-world example has to use varchar
    for the join.

    
    Thanks for any help,

    Stefan