Re: scoring differences between bitmasks

Ben <bench@xxxxxxxxxxxxxxx> · Sun, 2 Oct 2005 10:17:10 -0700

Yeah, I thought about this, but damn my 3-dimensional mind, I just  
can't convince myself this will work for 256 dimensions. :) Or  
rather, I can see how it could, *given* good placements of the grid  
points to snap each vector to. Unfortunately, I don't know enough  
theory to know what good placements would be.

On Oct 2, 2005, at 9:47 AM, Martijn van Oosterhout wrote:

On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote:

Yes, that's the straightforward way to do it. But given that my
vectors are 256 bits in length, and that I'm going to eventually have
about 4 million of them to search through, I was hoping greater minds
than mine had figured out how to do it faster, or how compute some
kind of indexing....... somehow.

Ah right, that's another kettle of fish. Well, your problem space is a
form of eulidean space in the sense that the triangle inequality  
holds,
eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index
classes could help.

Here comes something I thought up myself, but you may be able to come
up with something better.

1. For each string, store its distance from the strings:
0000000000 etc
0101010101 etc
0011001100 etc
maybe more or less, you'll need to investigate

2. Index each of these columns.
3. For your target string calculate the distances from each of these
4. Do a query like:

SELECT key, * from table
where dist_zero_string > $dist_zero_search-2 and dist_zero_string <  
$dist_zero_search+2
where dist_0101_string > $dist_0101_search-2 and dist_0101_string <  
$dist_0101_search+2
where dist_0011_string > $dist_0101_search-2 and dist_0011_string <  
$dist_0011_search+2
order by dist(key,searchvalue);

PostgreSQL 8.1 has bitmap index scans which could make the above quite
fast. Some theory may help determine the best key strings. For example
all-zeros and all-ones are redundant since you can calculate the
distance for one from the other. You also might need to vary your
distances to match with (the 2 above) depending on the average  
distance
to the strings and where you search.

You could use functional indexes to avoid storing the actual values in
the tables. Or you could take two or three of these values and make
points and use an rtree index on those to find nearby points (which
will represent nearby strings).

Anyway, this is just some random thoughts, hope it helps,
--
Martijn van Oosterhout   <kleptog@xxxxxxxxx>   http://svana.org/ 
kleptog/

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent  
is a
tool for doing 5% of the work and then sitting around waiting for  
someone
else to do the other 95% so you can sue them.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
      message can get through to the mailing list cleanly