unoptimized nested loops

Tim Kelly <gtkelly@xxxxxxxxxxxxxxxxx> · Tue, 31 May 2022 14:22:32 -0400

Hello Everyone,
I am stumped as to what I am doing wrong.  I have two tables

metadata: parent table, 1.28m records
data: child table, 1.24m records

metadata contains descriptions of the records in data.  data has two 
fields of concern, the id field, which is a foreign key to an identical 
field in metadata, and the content field, which contains text ranging 
from a few hundred to a few thousand characters.  The id fields are 
alphanumeric for historical reasons.  Table descriptions below, some 
fields omitted for brevity:

c_db=> \d metadata 

  Table "public.metadata"
      Column      |           Type           | Modifiers
------------------+--------------------------+-----------
 id               | character varying(60)    | not null
 author           | character varying(90)    |

Indexes:
    "metadata_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "data" CONSTRAINT "fk_metadata" FOREIGN KEY (id) REFERENCES 
metadata(id)

c_db=> \d data
            Table "public.data"
 Column  |          Type           | Modifiers
---------+-------------------------+-----------
 id      | character varying(30)   | not null
 content | text                    |

Indexes:
    "data_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fk_metadata" FOREIGN KEY (id) REFERENCES metadata(id)

I am regularly needing to search the content field in data for short 
text segments.  I can reduce the number of rows needing to be searched 
by looking for particular types of entries in metadata.  However, 
postgresql is apparently refusing to do so, and insists on searching the 
entire content column in the data table for the text segment of 
interest.  It takes more time to search the effort to reduce than the 
entire data table straight up.

To be specific with two approaches:

c_db=> select count(id) from data;
  count
---------
 1248954
(1 row)

Time: 648.358 ms

ic_db=> select count(id) from data where content like '%some text%';
  count
---------
 1167810
(1 row)

Time: 180144.251 ms

c_db=>select count(id) from metadata where author like '%Kelly%';

 count
-------
  3558
(1 row)

Time: 1625.455 ms

c_db=>select count(id) from data where data.content like '%some text%' 
and data.id in (select id from metadata where metadata.author like 
'%Kelly%');

 count
-------
   152
(1 row)

Time: 211010.598 ms

c_db=> explain analyze select count(id) from data where data.content 
like '%some text%' and data.id in (select id from metadata where 
metadata.author like '%Kelly%');

                                                                QUERY 
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=48203.30..48203.31 rows=1 width=124) (actual 
time=213021.968..213021.969 rows=1 loops=1)
   ->  Nested Loop Semi Join  (cost=0.00..48202.99 rows=125 width=124) 
(actual time=51392.697..213021.848 rows=152 loops=1)
         ->  Seq Scan on data  (cost=0.00..47132.93 rows=125 width=135) 
(actual time=0.176..183040.366 rows=1167810 loops=1)
               Filter: (data ~~ '%some text%'::text)
         ->  Index Scan using metadata_pkey on metadata 
(cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0 
loops=1167810)
               Index Cond: ((metadata.id)::text = (data.id)::text)
               Filter: ((metadata.author)::text ~~ '%Kelly%'::text)
 Total runtime: 213022.028 ms
(8 rows)

Time: 213024.273 ms

Alternatively, using an inner join,

c_db=> select count(id) from data inner join metadata on data.id = 
metadata.id where data.content like '%some text%' and metadata.author 
like '%Kelly%';

 count
-------
   152
(1 row)

Time: 212211.047 ms

c_db=> explain analyze select count(id) from data inner join metadata on 
data.id = metadata.id where data.context like '%some text%' and 
metadata.author like '%Kelly%';
                                                                QUERY 
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=48203.30..48203.31 rows=1 width=124) (actual 
time=212800.026..212800.026 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..48202.99 rows=125 width=124) (actual 
time=22463.361..212799.911 rows=152 loops=1)
         ->  Seq Scan on data  (cost=0.00..47132.93 rows=125 width=135) 
(actual time=0.542..182952.708 rows=1167810 loops=1)
               Filter: (data ~~ '%some text%'::text)
         ->  Index Scan using metadata_pkey on metadata 
(cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0 
loops=1167810)
               Index Cond: ((metadata.id)::text = (data.id)::text)
               Filter: ((metadata.author)::text ~~ '%Kelly%'::text)
 Total runtime: 212800.076 ms
(8 rows)

Time: 212805.008 ms

I do not see evidence that the nested loop is trying to reduce overhead 
by using the smaller set.  It seems to want to scan on data first either 
way.

I have run vacuum analyze, in the hopes that the optimizer is 
miscalculating, to no avail.  I seem to be unable to force postgresql to 
use the smaller set to reduce the search, even with the use of "in" on a 
subquery.

Does anyone have any ideas about what I am doing wrong on such a 
fundamentally normal operation?  I am sure I am missing something really 
obvious, but I can't even find discussions about forcing the use of 
smaller sets.  I apologize for having to ask about something that is 
almost certainly trivial.

I am not subscribed to the list; please reply-all or offline.

Thank you,
tim