Please don't top post. https://en.wikipedia.org/wiki/Posting_style#Placement_of_replies https://en.wikipedia.org/wiki/Posting_style#Choosing_the_proper_posting_style >> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@xxxxxxxxx> >> wrote: >> >> > : >> >An idea that has come up is to use a materialized view or secondary table >> >with triggers, where we would have 3 columns (id, key, value). >> > >> >I think this would allow us to store a tsvector and gin index. Giving us >> >the ability to use fulltext search on k:v pairs, then join the original >> >data on the id field to return the entire record. >> > : On Wed, 19 Apr 2017 11:57:26 -0700, Rj Ewing <ewing.rj@xxxxxxxxx> wrote: >I did some testing using a secondary table with the key, value column. >However I don't think this will provide the performance that we need. >Queries we taking 60+ seconds just for a count. SELECT count(*) or filtered? Either way, your statistics may be way off. Did you vacuum analyze the table after the inserts (and the index creation if it was done separately)? >With 1 million rows in the primary table, this resulted in 44 million rows >in the secondary k,v table for full text searching. The same query is es >takes ~50 ms on my local machine with 1/10th the ram allocated to es then >was allocated to psql. > >I'm gonna test using trigrams indexes on approx 10 json fields, and see if >that gives us what we are looking for. > >any thought on getting sub 1 sec queries on a table with 44 million rows? > >RJ Based on your description of the data [at top], I rather doubt trigrams will be an improvement over tsvector. And they're more cumbersome to use if you don't need better similarity matching than what tsvector offers [which itself is somewhat adjustable via dictionaries]. Without more detail re: your hardware, Postgresql version, what indexes are/will be available, the types of queries you want to run, etc., it's very hard to give really meaningful suggestions. The kind of query you have alluded to is pretty easily parallelized: it can be spread over multiple sessions with result aggregation done on the client side. Or, if you you have 9.6, you might try using backend parallelism: https://www.postgresql.org/docs/9.6/static/parallel-query.html [I've not used this, but some people have done it successfully.] If you can restrict the FTS query to certain keys: SELECT id FROM mytable WHERE tsquery( ... ) @@ to_tsvector(v) AND k IN ( ... ) GROUP BY id [note: according to David Rowley, GROUP BY may be parallelized whereas DISTINCT currently cannot be.] then given an index on 'k' it may be much faster than just the FTS query alone. Subject to key variability, it also may be improved by table partitioning to reduce the search space. If the FTS query is key restricted, you can parallelize either on the client or on the server. If the FTS query is not key restricted, you pretty much are limited to server side (and 9.6 or later). And I'm out of suggestions for now. Parallel query is your best bet for maximum performance, but unless you have enough RAM to hold the entire table and its indexes, and all the query workspaces, then I doubt you will be able to get anywhere near your optimistic execution target for FTS on 40+ million rows. YMMV, George -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general