On 4/20/07, Andrew Lazarus <andrew@xxxxxxxxxxxx> wrote:
I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow.
Let me chime in with the observation that this is a multidimensional nearest neighbour (reverse nearest neighbour and its close cousin, k-NN) that is well known in statistics, and particularly relevant to statistical learning and classification. Knowing the jargon might help you dig up efficient algorithms to mine your data; there are tons of fascinating papers available through Citeseer. In particular, I recommend the paper "Efficient k-NN Search on Vertically Decomposed Data" by de Vries et al, SIGMOD 2002 (PDF here: http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It proposes an algorithm called BOND to drastically reduce the search space by probalistic means. They give an example using image histograms, but the algorithm applies to any multidimensional data. Briefly put, it points out that proximity comparison can be computed vertically, a few dimensions at a time, and entire subsets can be thrown away when it's apparent that they are below a statistically derived lower bound. The only gotcha is that the algorithm derives much of its performance from the assumption that your data is vertically decomposed, one table per dimension, otherwise the search effectively incurs a sequential scan of the entire dataset, and then you're pretty much back to square one. The most common approach to nearest neighbour search is to use a spatial data structure. The classic algorithm is the kd-tree (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B tree, neither of which are available in PostgreSQL. If I remember correctly, R-trees have also been shown to be useful for high numbers of dimensions; with PostgreSQL you have R-trees and even better R-tree-equivalent support through GiST. I have no idea whether you can actually munge your integer vectors into something GiST can index and search, but it's a thought. (GiST, presumably, can also theoretically index kd-trees and other spatial trees.) Alexander.