Re: full text search on hstore or json with materialized view?

George Neuner <gneuner2@xxxxxxxxxxx> · Fri, 21 Apr 2017 09:03:02 -0400

On Thu, 20 Apr 2017 08:50:31 -0700, Rj Ewing <ewing.rj@xxxxxxxxx>
wrote:

>On Wed, Apr 19, 2017 at 9:55 PM, George Neuner <gneuner2@xxxxxxxxxxx> wrote:
>
>> ... Since you are *testing* with 1M records (that
>> create 44M k:v shards), I am assuming you will need to deal with much
>> more than that in deployment.  And if you think you need FTS, then you
>> must be expecting more than simple word matches [as below], else you
>> might do something simpler like
>>
>>   SELECT ...
>>     WHERE val ILIKE <value>
>
>the 1M records would most likely be the max. On average the tables would
>have more like 100,000 records each.

Ok, so my assumption was way off ... you should be able to achieve the
timing you want with appropriate indexing. 

>from my understanding, *ILIKE* doesn't do any text normalization, which is
>something we would like to have.

Right. If you want rooting/stemming or dictionary translation, then
you do need to use FTS.

>> >how would I write an AND query that filtered on 2 separate keys from the
>> >samples_lg_txt table?
>> >
>> >something like:
>> >
>> > SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
>> > samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND
>> > tsv @@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
>> > to_tsquery('value2'))*;
>>
>> You're overthinking it
>>
>>   SELECT count(distinct s.id)
>>     FROM  samples_lg_txt AS s
>>     JOIN  keys AS k ON k.id = s.key
>>     WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
>>        OR (k.name = 'key2' AND s.tsv @@ to_query('value2')
>
>but that is an OR query, I'm trying to do an AND query.

Sorry, I missed the AND in your original query.  Still the nested
SELECT is unnecessary.

Postgresql's planner/optimizer is pretty smart, and probably would
collapse your code into mine (modulo the AND/OR goof), but I prefer
not to rely on the planner to be smart ... that gets you into trouble
when you have to switch between DBMS.

>> There's actually no need to join if you can use the key name instead
>> of an integer id.  You can FK on strings, so you can still maintain an
>> identity table of keys.  E.g.,
>>
>> > id     | integer   |
>> > key    | vchar(32) | FK key(name) ...
>> > val    | text      |
>> > tsv    | tsvector  |
>>
>>
>> Then the query could be just
>>
>>   SELECT count(distinct id)
>>     FROM  samples_lg_txt
>>     WHERE (key = 'key1' AND tsv @@ to_query('value1')
>>        OR (key = 'key2' AND tsv @@ to_query('value2')
>>
>
>?this would make queries simpler?. 

Yes - it eliminates the joins, and the query runs on a single table.

>I guess a disadvantage to using a string
>for the key is that the db size would be larger, and thus not as likely to
>fit the entire table in ram. If there are only 63 keys across 44M rows, it
>seems that storing an smallint would take less space then storing the
>string.

Maybe.  Using the integer FK reduces the table size, but it requires a
join with the foreign table.  A join of two tables requires indexes
for the join columns on both tables [which may or may not already
exist], and produces [variously] a temporary hash or key relation
table that represents the rows of the "joined" table.  These temporary
structures can grow very large and may have to spill onto disk.

You can somewhat control that with the work_mem setting.  But remember
that the setting applies to every operation of every concurrent query
... so setting work_mem very high can backfire.

So saving one place can cost you in another.  TANSTAAFL.

>I don't really have a need for the identity table of keys. It's only
>purpose was to shrink the database size.
>
>Thanks again for the detailed responses!

George

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general