Re: Help request to improve function performance

Seref Arikan <serefarikan@xxxxxxxxxxxxxxxxxxxxx> · Wed, 22 Apr 2009 22:48:16 +0100

Hi Filip, 
First of all: thanks a lot for your kind response. Here is the create script for my schema: 

CREATE TABLE "app"."archetype_data" (
  "id" BIGINT NOT NULL, 
  "context_id" VARCHAR(1000), 

  "archetype_name" VARCHAR(1000), 
  "archetype_path" VARCHAR(1000), 
  "name" VARCHAR(1000), 
  "value_string" VARCHAR(1000), 
  "value_int" BIGINT, 
  "value_double" DOUBLE PRECISION, 

  "session_id" VARCHAR(1000), 
  "instance_index" INTEGER, 
  CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;

Now,
regarding your feedback, here are the points, hoping that you can give
me feedback, and bring me up to speed in the topic, for I've been in
the upper layers of the software world for so long :) Please forgive me
for further questions:

Are you telling me that due to name resolution process, my use of
variable and column names for context_id and session_id are causing
problems? I'll change variable names into names which would be
obviously different from column names. 

I used the temp table to speed up the following selects, since the
actual table has more than 9 million rows. after creating the temp
table, I am selecting from 50K rows. Am I wrong about the performance
gain here? What would you suggest instead?

Temp tables forced me to use execute, after hitting a known
problem, also expressed in the faq, is this what you're talking about?

I
will be investing serious time into postgresql from now on, and I hope
you can give me couple of useful hints, to ease my way forward :)
Looking at the schema, can you think of a better way to send this
result set to a java based app? 

Many thanks again

Kind regards
Seref

2009/4/22 Filip Rembiałkowski <plk.zuber@xxxxxxxxx>

2009/4/22 sarikan <serefarikan@xxxxxxxxxxxxxxxxxxxxx>

Dear members of the list,

I have a function which returns a custom type, that has only two fields,

each of them being varchar arrays.

The reason that I have written this function is that I have a table

basically with the following structure (with simplified column names)

name_col1  name_col2 sessionId

value1       value3       id1

value2       value2       id1

value4       value4       id1

value7       value4       id2

value2       value2       id2

value4       value4       id2

value1       value5       id3

Why not post your REAL schema? It would make life easier, both for you and for people trying to help.

So mutliple rows are bound together with sessionIds, and I need to get back

all rows with a query, grouped by sessionID. However, group by sql statement

does not solve my problem, since I get back a lot of rows, which I have to

group into objects again in my application. What I need is a way to return

all rows having the same sessionId as a single row. Of course this is not

possible with this table, so I've created a custom type, which has array

type columns. The following function gets all rows that belongs to a

patient, and for each session id, it inserts rows with that session id into

array fields of the custom type.

The problem is, it is very slow! Getting back all the rows with a select

takes 360 ms, while getting back the results of this function takes 50

seconds! Is there any way I can make the following function faster, or any

other methods you can recommend to do what I'm trying to do?  I am trying to

avoid hundreds of calls to db, or grouping query results in my middleware

application. Here comes the function, and your help will be much

appreciated.

Best Regards

Seref

(below code edited to be more readable; logic unchanged)

CREATE or REPLACE FUNCTION getNodeContainers( context_Id varchar) RETURNS

setof NodesContainer AS

$$

DECLARE

archetype_data_row app.archetype_data;

archetype_data_row_main app.archetype_data;

nodescontainervar NodesContainer;

session_Id varchar;

indexVar integer := 0;

BEGIN

    CREATE TEMP TABLE all_rows_of_patient AS select * from app.archetype_data

    WHERE context_id = context_Id;

    FOR session_Id IN

      SELECT distinct session_id from all_rows_of_patient

    LOOP -- do the following for each session_ID

        indexVar := 0;

        FOR archetype_data_row IN --select rows that belong to this session ID

                SELECT * from all_rows_of_patient

        WHERE session_id = session_Id and context_id = context_Id

        LOOP

                nodescontainervar.name[indexVar] := archetype_data_row.name;

                nodescontainervar.context_Id[indexVar] := archetype_data_row.context_Id;

                indexVar := indexVar + 1;

        END LOOP;

        return NEXT nodescontainervar;

    END LOOP;

    drop table all_rows_of_patient;

    return;

END;

$$ LANGUAGE 'plpgsql';

please read above code - thats what postgres actually executes. column names have precedence before variable names in name resolution.
conditions like 
 WHERE context_id = context_Id;
 WHERE session_id = session_Id and context_id = context_Id

are obviously no-op conditions, not what you really want.

I hope now it's clear now why this function has long execution time :)

some other remarks:

1) you use temp tables inside a function, which is rather bad (search archives for explanation). try to avoid it.

2) usage of indexvar is not needed - there are array operators and functions
3) if you get rid of temp table, this function could be marked as STABLE,which will prevent penalty in some strange situations
4) mark your function as STRICT, which will save some CPU cycles when someone calls it on null input

HTH.

-- 
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@xxxxxxxxx
http://filip.rembialkowski.net/