Re: loop vs. aggregate was: update and group by/aggregate

Ivan Sergio Borgonovo <mail@xxxxxxxxxxxxxxx> · Wed, 27 Aug 2008 13:19:30 +0200

On Wed, 27 Aug 2008 10:32:43 +0200
Ivan Sergio Borgonovo <mail@xxxxxxxxxxxxxxx> wrote:

> create or replace function UpdateAuthors()
> returns void
> as
> $$
> declare
>   _row record;
>   _ItemID bigint;
>   _Authors varchar(1024);
>   _Name varchar(50);
> begin
>   _Authors:='';
>   _ItemID:=null;
>   for _row in select a.Name, ia.ItemID from {catalog_itemauthor} ia
>     join {catalog_author} a on a.AuthorID=ia.AuthorID
>     order by ia.ItemID
>       loop
>     if(_row.ItemID<>_ItemID) then
>       if(length(_Authors)>2) then
>         _Authors:=substring(_Authors from 3);
>         update {catalog_items} set Authors=_Authors
>           where ItemID=_ItemID;
>       end if;
>       _Authors:='';
>     end if;
>     _ItemID:=_row.ItemID;
>     _Name:=trim(' \t' from _row.Name);
>     if(length(_Name)>0) then
>       _Authors:= _Authors || ', ' || _Name;
>     end if;
>   end loop;
> return;
> end;
> $$ language plpgsql volatile;

Replying to myself:
update catalog_items set Authors=s.Authors
from (
  select ia.ItemID,
  array_to_string(array_accum(trim(' \t]' from a.Name)),', ') as
  Authors
  from catalog_itemauthor ia
  join catalog_author a on a.AuthorID=ia.AuthorID
  where a.Name is not null and
  length(trim(' \t' from a.Name))>1
  group by ia.ItemID)
  as s where s.ItemID=catalog_items.ItemID;

but this looks much slower than the function:
function: 113sec
vs.
single statement: 488sec
I repeated the test 3 times with similar results.
Can anybody explain why aggregates under perform so badly?
I just read that most of the times I can't be smarter than the
planner and I thought that this would be one of the circumstances 
theplanner could outperform my handwritten function.

here is the explain:
"Hash Join  (cost=137408.51..271599.78 rows=209674 width=221)"
"  Hash Cond: ("outer".itemid = "inner".itemid)"
"  ->  HashAggregate  (cost=32994.81..36664.11 rows=209674 width=58)"
"        ->  Hash Join  (cost=8544.62..31946.44 rows=209674 width=58)"
"              Hash Cond: ("outer".authorid = "inner".authorid)"
"              ->  Seq Scan on catalog_itemauthor ia  (cost=0.00..10297.21 rows=629021 width=12)"
"              ->  Hash  (cost=8309.00..8309.00 rows=94248 width=54)"
"                    ->  Seq Scan on catalog_author a  (cost=0.00..8309.00 rows=94248 width=54)"
"                          Filter: ((name IS NOT NULL) AND (length(btrim((name)::text, E' \011'::text)) > 1))"
"  ->  Hash  (cost=79538.96..79538.96 rows=833496 width=189)"
"        ->  Seq Scan on catalog_items  (cost=0.00..79538.96 rows=833496 width=189)"

thanks to Pavel who suggested the use of array_to_string

Sam... I did your same error the first time I wrote the above
statement... missing the where clause but I can't understand the
need for your longer version "to ensure that characters trimmed from
the authors' name are consistent.

I prefer to put constraint on the length of varchar as an early
warning for something that is unexpected.
eg. if the loop was not working as expected I'd get an error. Infact
that's exactly what happened during development of the above
function.
If Authors was a text field I'd be waiting forever, fill the DB with
rubbish etc...
There are not unlimited fields... there are fields where the upper
limit may be hard to guess.
1024 is actually an inherited choice... but I did some research to
find a reasonable upper limit (highest # of authors * longest
author).

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it