Re: Improving performance of merging data between tables

Andy Colson <andy@xxxxxxxxxxxxxxx> · Mon, 29 Dec 2014 11:00:00 -0600

On 12/28/2014 3:49 PM, Pawel Veselov wrote:
Hi.

I was wondering if anybody would have any ideas on how to improve
certain operations that we are having.

<SNIP>

Besides "can somebody please look at this and let me know if I'm doing
something utterly stupid", here are my questions.

1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
function? All I see is that the calls to merge_all() function take long
time, and the CPU is high while this is going on.

First, I'll admit I didn't read your entire post.

I can think of a couple methods:

1) try each of the statements in merge_all by hand with an "explain 
analyze" in front to see which is slow.  Look for things that hit big 
tables without an index.  Check that fk lookups are indexes.

2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
http://www.postgresql.org/docs/9.4/static/pgstatstatements.html

I have used this to profile some functions, and it worked pretty well. 
Mostly I use it on a test box, but once ran it on the live, which was 
scary, but worked great.

3) try auto-explain:
http://www.postgresql.org/docs/9.4/static/auto-explain.html

I've never used it, so don't know if it'll show each statement inside a 
function.  Dumps stuff to the log AFAIK, so you'll have to dig out the 
info by hand.

> 2) Is there a better way to merge individual rows, except doing
> UPDATE/INSERT in a loop, and would that be CPU expensive?
>

Not that I know of.  I use pretty much the same thing.  Soon!  we will 
have merge/upsert support.  Hopefully it'll be fast.

> 3) Is there a better way to merge whole tables? However, note that I
> need to translate primary keys from node main table into the common main
> table, as they are used as foreign keys, hence the loops. I suspect the
> looping is CPU intensive.

Avoiding loops and doing things as sets is the best way.  If possible. 
The only loop I saw was looping over the merge_xxx tables, which is 
probably the only way.

If possible (if you haven't already) you could add and extra column to 
your secondary table that you can set as the main table's key.

bulk insert into second;
update second set magic = (select key from main where ... );

Then, maybe, you can do two ops in batch:

update main (where key exists in main)
insert into main (where key not exists in main)

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general