Re: qsort again (was Re: Strange Create Index

Ron <rjpeace@xxxxxxxxxxxxx> · Wed, 15 Feb 2006 23:30:54 -0500

At 08:21 PM 2/15/2006, Tom Lane wrote:
Ron <rjpeace@xxxxxxxxxxxxx> writes:
> How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices.

OK, this is a bad way to do median-of-n partitioning for a few 
reasons.  See Sedgewick's PhD thesis for details.

Basically, if one is using median-of-n partitioning to choose a 
pivot, one should do it in =one= pass, and n for that pass should be 
<= the numbers of registers in the CPU.  Since the x86 ISA has 8 
GPR's, n should be <= 8.  7 for instance.

Special purposing the median-of-n code so that the minimal number of 
comparisons and moves is used to sort the sample and then 
"partitioning in place" is the best way to do it.  In addition, care 
must be taken to deal with the possibility that many of the keys may be equal.

The (pseudo) code looks something like this:

qs(a[],L,R){
if((R-L) > SAMPLE_SIZE){ // Not worth using qs for too few elements
   SortSample(SAMPLE_SIZE,a[],L,R);
   // Sorts SAMPLE_SIZE= n elements and does median-of-n 
partitioning for small n
   // using the minimal number of comparisons and moves.
   // In the process it ends up partitioning the first n/2 and last 
n/2 elements
   // SAMPLE_SIZE is a constant chosen to work best for a given CPU.
   //  #GPRs - 1 is a good initial guess.
   // For the x86 ISA, #GPRs - 1 = 7. For native x86-64, it's 15.
   // For most RISC CPUs it's 31 or 63.  For Itanium, it's 127 (!)
   pivot= a[(L+R)>>1]; i= L+(SAMPLE_SIZE>>1); j= R-(SAMPLE_SIZE>>1);
   for(;;){
      while(a[++i] < pivot);
      while(a[--j] > pivot);
      if(i >= j) break;
      if(a[i] > a[j]) swap(a[i],a[j]);
      }
   if((i-R) >= (j-L)){qs(a,L,i-1);}
   else{qs(a,i,R);}
else{OofN^2_Sort(a,L,R);}
// SelectSort may be better than InsertSort if KeySize in bits << 
RecordSize in bits
} // End of qs

Given that the most common CPU ISA in existence has 8 GPRs, 
SAMPLE_SIZE= 7 is probably optimal:
t= (L+R);
the set would be {L; t/8; t/4; t/2; 3*t/4; 7*t/8; R;}
==> {L; t>>3; t>>2; t>>1; (3*t)>>2; (7*t)>>3; R} as the locations.
Even better (and more easily scaled as the number of GPR's in the CPU 
changes) is to use
the set {L; L+1; L+2; t>>1; R-2; R-1; R}
This means that instead of 7 random memory accesses, we have 3; two 
of which result in a
burst access for three elements each.
That's much faster; _and_ using a sample of 9, 15, 31, 63, etc (to 
max of ~GPRs -1) elements is more easily done.

It also means that the work we do sorting the sample can be taken 
advantage of when starting
inner loop of quicksort: items L..L+2, t, and R-2..R are already 
partitioned by SortSample().

Insuring that the minimum number of comparisons and moves is done in 
SortSample can be down by using a code generator to create a 
comparison tree that identifies which permutation(s) of n we are 
dealing with and then moving them into place with the minimal number of moves.

SIDE NOTE: IIRC glibc's qsort is actually merge sort.  Merge sort 
performance is insensitive to all inputs, and there are way to 
optimize it as well.

I'll leave the actual coding to someone who knows the pg source 
better than I do.
Ron