Hello Everyone,
I am stumped as to what I am doing wrong. I have two tables
metadata: parent table, 1.28m records
data: child table, 1.24m records
metadata contains descriptions of the records in data. data has two
fields of concern, the id field, which is a foreign key to an identical
field in metadata, and the content field, which contains text ranging
from a few hundred to a few thousand characters. The id fields are
alphanumeric for historical reasons. Table descriptions below, some
fields omitted for brevity:
c_db=> \d metadata
Table "public.metadata"
Column | Type | Modifiers
------------------+--------------------------+-----------
id | character varying(60) | not null
author | character varying(90) |
Indexes:
"metadata_pkey" PRIMARY KEY, btree (id)
Referenced by:
TABLE "data" CONSTRAINT "fk_metadata" FOREIGN KEY (id) REFERENCES
metadata(id)
c_db=> \d data
Table "public.data"
Column | Type | Modifiers
---------+-------------------------+-----------
id | character varying(30) | not null
content | text |
Indexes:
"data_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"fk_metadata" FOREIGN KEY (id) REFERENCES metadata(id)
I am regularly needing to search the content field in data for short
text segments. I can reduce the number of rows needing to be searched
by looking for particular types of entries in metadata. However,
postgresql is apparently refusing to do so, and insists on searching the
entire content column in the data table for the text segment of
interest. It takes more time to search the effort to reduce than the
entire data table straight up.
To be specific with two approaches:
c_db=> select count(id) from data;
count
---------
1248954
(1 row)
Time: 648.358 ms
ic_db=> select count(id) from data where content like '%some text%';
count
---------
1167810
(1 row)
Time: 180144.251 ms
c_db=>select count(id) from metadata where author like '%Kelly%';
count
-------
3558
(1 row)
Time: 1625.455 ms
c_db=>select count(id) from data where data.content like '%some text%'
and data.id in (select id from metadata where metadata.author like
'%Kelly%');
count
-------
152
(1 row)
Time: 211010.598 ms
c_db=> explain analyze select count(id) from data where data.content
like '%some text%' and data.id in (select id from metadata where
metadata.author like '%Kelly%');
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=48203.30..48203.31 rows=1 width=124) (actual
time=213021.968..213021.969 rows=1 loops=1)
-> Nested Loop Semi Join (cost=0.00..48202.99 rows=125 width=124)
(actual time=51392.697..213021.848 rows=152 loops=1)
-> Seq Scan on data (cost=0.00..47132.93 rows=125 width=135)
(actual time=0.176..183040.366 rows=1167810 loops=1)
Filter: (data ~~ '%some text%'::text)
-> Index Scan using metadata_pkey on metadata
(cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0
loops=1167810)
Index Cond: ((metadata.id)::text = (data.id)::text)
Filter: ((metadata.author)::text ~~ '%Kelly%'::text)
Total runtime: 213022.028 ms
(8 rows)
Time: 213024.273 ms
Alternatively, using an inner join,
c_db=> select count(id) from data inner join metadata on data.id =
metadata.id where data.content like '%some text%' and metadata.author
like '%Kelly%';
count
-------
152
(1 row)
Time: 212211.047 ms
c_db=> explain analyze select count(id) from data inner join metadata on
data.id = metadata.id where data.context like '%some text%' and
metadata.author like '%Kelly%';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=48203.30..48203.31 rows=1 width=124) (actual
time=212800.026..212800.026 rows=1 loops=1)
-> Nested Loop (cost=0.00..48202.99 rows=125 width=124) (actual
time=22463.361..212799.911 rows=152 loops=1)
-> Seq Scan on data (cost=0.00..47132.93 rows=125 width=135)
(actual time=0.542..182952.708 rows=1167810 loops=1)
Filter: (data ~~ '%some text%'::text)
-> Index Scan using metadata_pkey on metadata
(cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0
loops=1167810)
Index Cond: ((metadata.id)::text = (data.id)::text)
Filter: ((metadata.author)::text ~~ '%Kelly%'::text)
Total runtime: 212800.076 ms
(8 rows)
Time: 212805.008 ms
I do not see evidence that the nested loop is trying to reduce overhead
by using the smaller set. It seems to want to scan on data first either
way.
I have run vacuum analyze, in the hopes that the optimizer is
miscalculating, to no avail. I seem to be unable to force postgresql to
use the smaller set to reduce the search, even with the use of "in" on a
subquery.
Does anyone have any ideas about what I am doing wrong on such a
fundamentally normal operation? I am sure I am missing something really
obvious, but I can't even find discussions about forcing the use of
smaller sets. I apologize for having to ask about something that is
almost certainly trivial.
I am not subscribed to the list; please reply-all or offline.
Thank you,
tim