Martijn van Oosterhout <kleptog@xxxxxxxxx> writes: > Given that all it's doing is counting bits, a simple fix would be to > loop over bytes, use XOR and count ones. For extreme speedup create a > lookup table with 256 entries to give you the answer straight away... Yeah, I just finished doing that and got about a 20x overall speedup (30-some seconds to build the index instead of 10 minutes). However, hemdistsign is *still* 70% of the runtime after doing that. The problem seems to be that gtsvector_picksplit is calling it an inordinate number of times: 0.53 30.02 1649/1649 FunctionCall2 <cycle 2> [19] [20] 52.4 0.53 30.02 1649 gtsvector_picksplit [20] 29.74 0.00 23519673/33035195 hemdistsign [18] 0.14 0.00 22985469/22985469 hemdistcache [50] 0.12 0.00 268480/10030581 makesign [25] 0.02 0.00 270400/270400 fillcache [85] 0.00 0.00 9894/4077032 AllocSetAlloc [34] 0.00 0.00 9894/2787477 MemoryContextAlloc [69] (hemdistcache calls hemdistsign --- I think gprof is doing something funny with tail-calls here, and showing hemdistsign as directly called from gtsvector_picksplit when control really arrives through hemdistcache.) The bulk of the problem is clearly in this loop, which performs O(N^2) comparisons to find the two entries that are furthest apart in hemdist terms: for (k = FirstOffsetNumber; k < maxoff; k = OffsetNumberNext(k)) { for (j = OffsetNumberNext(k); j <= maxoff; j = OffsetNumberNext(j)) { if (k == FirstOffsetNumber) fillcache(&cache[j], GETENTRY(entryvec, j)); size_waste = hemdistcache(&(cache[j]), &(cache[k])); if (size_waste > waste) { waste = size_waste; seed_1 = k; seed_2 = j; } } } I wonder if there is a way to improve on that. regards, tom lane