On 21-4-2007 1:42 Mark Kirkwood wrote:
I don't think that will work for the vector norm i.e:
|x - y| = sqrt(sum over j ((x[j] - y[j])^2))
I don't know if this is usefull here, but I was able to rewrite that
algorithm for a set of very sparse vectors (i.e. they had very little
overlapping factors) to something like:
|x - y| = sum over j (x[j]^2) + sum over j (y[j]^2)
+ for each j where x[j] and y[j] are both non-zero: - (x[j]^2 +
y[j]^2) + (x[j] - y[j])^2
The first two parts sums can be calculated only once. So if you have
very little overlap, this is therefore much more efficient (if there is
no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this
rewritten calculation allows you to store the X and Y vectors using a
trivial table-layout vector(x,i,value) which is only filled with
non-zero's and which you can trivially self-join to find the closest
matches. You don't care about the j's where there is either no x or
y-value anyway with this rewrite.
I can compare over 1000 y's of on average 100 elements to two x's of
over 1000 elements on just a single 1.8Ghz amd processor. (I use it for
a bi-kmeans algorithm, so there are only two buckets to compare to).
So it might be possible to rewrite your algorithm to be less
calculation-intensive. Obviously, with a dense-matrix this isn't going
to work, but there may be other ways to circumvent parts of the
algorithm or to cache large parts of it.
It might also help to extract only the 6 relevant columns into a
seperate temporary table which will have much smaller records and thus
can fit more records per page.
Best regards,
Arjen