At 06:35 AM 2/16/2006, Steinar H. Gunderson wrote:
On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote:
> Even better (and more easily scaled as the number of GPR's in the CPU
> changes) is to use
> the set {L; L+1; L+2; t>>1; R-2; R-1; R}
> This means that instead of 7 random memory accesses, we have 3; two
> of which result in a burst access for three elements each.
Isn't that improvement going to disappear competely if you choose a bad
pivot?
Only if you _consistently_ (read: "the vast majority of the time":
quicksort is actually darn robust) choose a _pessimal_, not just
"bad", pivot quicksort will degenerate to the O(N^2) behavior
everyone worries about. See Corman & Rivest for a proof on this.
Even then, doing things as above has benefits:
1= The worst case is less bad since the guaranteed O(lgs!) pivot
choosing algorithm puts s elements into final position.
Worst case becomes better than O(N^2/(s-1)).
2= The overhead of pivot choosing can overshadow the benefits using
more traditional methods for even moderate values of s. See
discussions on the quicksort variant known as "samplesort" and
Sedgewick's PhD thesis for details. Using a pivot choosing algorithm
that actually does some of the partitioning (and does it more
efficiently than the "usual" partitioning algorithm does) plus using
partition-in-place (rather then Lomuto's method) reduces overhead
very effectively (at the "cost" of more complicated / delicate to get
right partitioning code). The above reduces the number of moves used
in a quicksort pass considerably regardless of the number of compares used.
3= Especially in modern systems where the gap between internal CPU
bandwidth and memory bandwidth is so great, the overhead of memory
accesses for comparisons and moves is the majority of the overhead
for both the pivot choosing and the partitioning algorithms within
quicksort. Particularly random memory accesses. The reason (#GPRs -
1) is a magic constant is that it's the most you can compare and move
using only register-to-register operations.
In addition, replacing as many of the memory accesses you must do
with sequential rather than random memory accesses is a big deal:
sequential memory access is measured in 10's of CPU cycles while
random memory access is measured in hundreds of CPU cycles. It's no
accident that the advances in Grey's sorting contest have involved
algorithms that are both register and cache friendly, minimizing
overall memory access and using sequential memory access as much as
possible when said access can not be avoided. As caches grow larger
and memory accesses more expensive, it's often worth it to use a
BucketSort+QuickSort hybrid rather than just QuickSort.
...and of course if you know enough about the data to be sorted so as
to constrain it appropriately, one should use a non comparison based
O(N) sorting algorithm rather than any of the general comparison
based O(NlgN) methods.
> SIDE NOTE: IIRC glibc's qsort is actually merge sort. Merge sort
> performance is insensitive to all inputs, and there are way to
> optimize it as well.
glibc-2.3.5/stdlib/qsort.c:
/* Order size using quicksort. This implementation incorporates
four optimizations discussed in Sedgewick:
I can't see any references to merge sort in there at all.
Well, then I'm not the only person on the lists whose memory is faulty ;-)
The up side of MergeSort is that its performance is always O(NlgN).
The down sides are that it is far more memory hungry than QuickSort and slower.
Ron