On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote: > Yes, that's the straightforward way to do it. But given that my > vectors are 256 bits in length, and that I'm going to eventually have > about 4 million of them to search through, I was hoping greater minds > than mine had figured out how to do it faster, or how compute some > kind of indexing....... somehow. Ah right, that's another kettle of fish. Well, your problem space is a form of eulidean space in the sense that the triangle inequality holds, eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index classes could help. Here comes something I thought up myself, but you may be able to come up with something better. 1. For each string, store its distance from the strings: 0000000000 etc 0101010101 etc 0011001100 etc maybe more or less, you'll need to investigate 2. Index each of these columns. 3. For your target string calculate the distances from each of these 4. Do a query like: SELECT key, * from table where dist_zero_string > $dist_zero_search-2 and dist_zero_string < $dist_zero_search+2 where dist_0101_string > $dist_0101_search-2 and dist_0101_string < $dist_0101_search+2 where dist_0011_string > $dist_0101_search-2 and dist_0011_string < $dist_0011_search+2 order by dist(key,searchvalue); PostgreSQL 8.1 has bitmap index scans which could make the above quite fast. Some theory may help determine the best key strings. For example all-zeros and all-ones are redundant since you can calculate the distance for one from the other. You also might need to vary your distances to match with (the 2 above) depending on the average distance to the strings and where you search. You could use functional indexes to avoid storing the actual values in the tables. Or you could take two or three of these values and make points and use an rtree index on those to find nearby points (which will represent nearby strings). Anyway, this is just some random thoughts, hope it helps, -- Martijn van Oosterhout <kleptog@xxxxxxxxx> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment:
pgpDTgiP595n9.pgp
Description: PGP signature