Re: Highly Efficient Custom Sorting

Eliot Gable <egable+pgsql-performance@xxxxxxxxx> · Sat, 3 Jul 2010 16:17:56 -0400

Read RFC 2782 on random weighted load balancing of SRV records inside DNS. That algorithm is what I need implemented, but with an extension. I have groups of records I need to have the algorithm applied to where each group is treated separately from the others. I understand the operational complexity of what I'm doing. It is more like N^3, or more precisely G*P*W where G is the number of groups, P the number of priorities per group, and W the number of different weights per priority. But, the complexity of the algorithm means nothing in terms of performance  or run time because it will only ever deal with very small sets of records (maybe 20 rows of data, tops). Even if the algorithm were N^4, it wouldn't matter with that few records. But, more importantly, there are constraints in how the data is sub-divided. Primarily, G < P < W. Further, G and P are groupings which subdivide the entire set of data and the groups do not have overlapping data. So, maybe it's more like N^2.2 or something. But, regardless, we're only talking about 20 rows, tops. 

The issue is how efficiently the languages can deal with arrays. In Perl, I have to parse a string into an array of data, then break it up into sub arrays inside associative arrays just to work with the input. I also have to splice the array to remove elements, which I don't think is very efficient. Any way I could come up with of removing elements involved rebuilding the entire array. The same thing goes for pl/pgsql. Dealing with arrays there is also not very efficient. I do a lot of constructing of arrays from sets of data using myvar = array(select blah);. While pl/pgsql was considerably faster than Perl, it cannot come close to what I did in C++ using a hash of a hash of a linked list. The two hash tables provide my groupings and the linked list gives me something that is highly efficient for removing elements as I pick them.

I've looked through the documentation on how to re-write this in C, but I cannot seem to find good documentation on working with the input array (which is an array of a complex type). I also don't see good documentation for working with the complex type. I found stuff that talks about constructing a complex type in C and returning it. However, I'm not sure how to take an input complex type and deconstruct it into something I can work with in C. Also, the memory context management stuff is not entirely clear. Specifically, how do I go about preserving the pointers to the data that I allocate in multi-call memory context so that they still point to the data on the next call to the function for the next result row? Am I supposed to set up some global variables to do that, or am I supposed to take a different approach? If I need to use global variables, then how do I deal with concurrency?

On Sat, Jul 3, 2010 at 2:08 PM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:

On Fri, Jul 2, 2010 at 11:17 PM, Eliot Gable

<egable+pgsql-performance@xxxxxxxxx> wrote:

> Well, I re-wrote the algorithm in Perl. However, it did not solve the speed

> issue. Running time now is a whopping 240+ ms instead of the 31.8ms I was

> getting before (15.2 of which is sorting). Here is the Perl code on the

> sorting. I won't post the pl/pgsql code, because this is far more clear (in

> my opinion) on what the algorithm does:

> DROP TYPE IF EXISTS glbtype CASCADE;

> CREATE TYPE glbtype AS (

> id INTEGER,

> "group" TEXT,

> priority INTEGER,

> weight INTEGER

> );

> DROP TYPE IF EXISTS glbtype_result CASCADE;

> CREATE TYPE glbtype_result AS (

> id INTEGER,

> priority INTEGER,

> weight INTEGER,

> "order" BIGINT

> );

> CREATE OR REPLACE FUNCTION GroupedRandomWeightedLB(glbtype[]) RETURNS SETOF

> glbtype_result AS

ok, I didn't take the time to read your implementation and completely

understand it, but it looks like you're looking at a N^2 sorting at

best.

You probably want to do something like this (it might not be quite

right, you need to explain what each of your input array fields is

supposed to represent):

CREATE OR REPLACE FUNCTION GroupedRandomWeightedLB(glbtype[]) RETURNS SETOF

glbtype_result AS

$$

  with g as (select unnest(glbtype) as t)

    select array(select ((t).id, (t).priority) (t).weight), 0)::glbtype_result

      from g order by (t).group, (t).priority, random() * (t).weight);

$$ language sql;

(not sure what "order" is, is that the rownum, can't that just be

inferred from the array position?)

merlin

-- 
Eliot Gable

"We do not inherit the Earth from our ancestors: we borrow it from our children." ~David Brower 

"I decided the words were too conservative for me. We're not borrowing from our children, we're stealing from them--and it's not even considered to be a crime." ~David Brower

"Esse oportet ut vivas, non vivere ut edas." (Thou shouldst eat to live; not live to eat.) ~Marcus Tullius Cicero