It's doable. but requires a lot of work. We need support for this. Oleg On Sun, 18 Jul 2010, Howard Rogers wrote:
I asked recently about a performance problem I'd been having with some full text queries, and got really useful help that pointed me to the root issues. Currently, I'm trying to see if our document search (running on Oracle Text) can be migrated to PostgreSQL, and the reason I asked that earlier question points to a fundamental design issue we'll have with PostgreSQL that doesn't affect us in Oracle (not, I hasten to add, that that means Oracle is better/right-er/whatever. It's just different -but the difference will cause us a problem). Consider the following example (which is just one of 40-odd I could have picked). Some of our documents are in panoramic format, for example. But not many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic', called 'sb12'. It's either 'y' or 'n' for any document. So a search for 'sb12n' (find me all documents which are not panoramic) is logically the same as a search for 'not sb12y'. However, 95% or more of documents will be an sb12n, because hardly any documents are panoramic in the first place. So. although the numeric outcome of 'sb12n' and 'not sb12y' will always be the same, you would have to check the entire table to find which ones are 'sb12n' (because most documents are marked that way), whereas you'd only have to check the 5% of records to find 'sb12y', because so few are marked that way. But in Oracle Text, this doesn't seem to happen: SQL> select count(*) from search_digital_rm where contains(textsearch,'bat and sb12n')>0; COUNT(*) ---------- 3040 Elapsed: 00:00:00.10 SQL> select count(*) from search_digital_rm where contains(textsearch,'bat not sb12y')>0; COUNT(*) ---------- 3040 Elapsed: 00:00:00.06 In both cases, the same number of records are returned. But, within a margin of error, the time taken to do each test is about the same. Even though the first test must be matching 'sb12n' for many millions of records, it's taking not much longer than the search for 'sb12y', which can only match about 90,000. It would seem (I can't tell from the explain plan itself) as though what's happened is that the set of 'bat' records has been fetched first (about 8000 in all). For so few records, whether you're looking for sb12y or sb12n then becomes mostly irrelevant for timing purposes, and hence the duration equivalence of both queries. This is not what happens in PostgreSQL, however (as I now know, thanks to the help on my question from a couple of days ago): ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n'); count ------- 3849 (1 row) Time: 408.962 ms ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y'); count ------- 3849 (1 row) Time: 11.533 ms Now, one test takes about 40 times longer than the other, though the one taking just 11ms is as fast as Oracle can manage (impressive, considering I've done absolutely nothing to tune this PostgreSQL testbed as yet!). Logically equivalent the two tests may be, but hunting through lots of sb12n records and working out which are related to bats is apparently a lot slower than finding things the other way around, it would seem. I'm wondering firstly if there's any way I can configure PostgreSQL FTS so that it produces the sort of results we've gotten used to from Oracle, i.e., where search speeds do not go up wildly when a 'search term' is applied that happens to be used by the vast majority of document records. (For example, we currently allows searches for file types, where 80% of documents would be "word documents", another 19% would be PDFs and the remaining 1% of documents could be pretty much anything else! We can't have people searching for "definitely want only Word documents" if that means matching 8 million records and search speeds shoot to the stratosphere as a result). Secondly, I'm open to any suggestions as to how you would organise things or re-write the SQL so that the "attribute filter" is only applied to the small subset of records which match the relevant "real word" search term, if that's what's needed here. In other words, is my best bet in the earlier examples to fetch *all* "bat" records, and then nest that query within an outer query that adds a test for a separate attribute column? Or is something else called for here? Would appreciate any thoughts on the subject! Regards HJR
Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general