Re: Parallel Scaling of a pgplsql problem

Venki Ramachandran <venki_ramachandran@xxxxxxxxx> · Wed, 25 Apr 2012 14:12:03 -0700 (PDT)

Another question (probably a silly mistake) while debugging this problem:
I put in insert statements into the pgplsql code to collect the current_timestamp and see where the code was spending most of it time....The output is as follows:

                                                                    log_text                        
                                             |        insertdatetime
-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------
 INPUT VARIABLES,Src,emp_id_1,emp_id_100,StartDate,2012-01-01,EndDate,2012-02-01,Tou,DOnPk,CRR Type,ptp-obligations,MinPositivePirce,1.00 | 2012-04-25 20:56:42.691965+00
 Starting get_DAM_Value function                                                                                
                                                      | 2012-04-25 20:56:42.691965+00
 StartDate,2012-01-01 01:00:00,EndDate,2012-02-02 00:00:00                                                                                           | 2012-04-25 20:56:42.691965+00
 StartHr,vSrcprice,vSinkprice,vDiff,vAvgValue,vTotalDAMValue,vPkTotal,vOffPkTotal                                                          
      | 2012-04-25 20:56:42.691965+00
 2012-01-01 01:00:00,18.52,16.15,-2.370000,-2.370000,0.000000,0.000000,-2.370000                                                                 | 2012-04-25 20:56:42.691965+00
 2012-01-01 02:00:00,17.22,15.60,-1.620000,-1.620000,0.000000,0.000000,-3.990000                                                                 | 2012-04-25 20:56:42.691965+00
 2012-01-01 03:00:00,18.22,17.55,-0.670000,-0.670000,0.000000,0.000000,-4.660000                                
                                 | 2012-04-25 20:56:42.691965+00
 2012-01-01 04:00:00,18.53,18.13,-0.400000,-0.400000,0.000000,0.000000,-5.060000                                                                 | 2012-04-25 20:56:42.691965+00
 ...
 ...
 ...
 ...
 2012-02-02 00:00:00,2,17.13,17.13,0.000000,0.000000,0.000000,-7940.250000,-5216.290000                                                          | 2012-04-25
 20:56:42.691965+00
 2012-02-02 00:00:00,3,16.54,16.54,0.000000,0.000000,0.000000,-7940.250000,-5216.290000                                                          | 2012-04-25 20:56:42.691965+00
 2012-02-02 00:00:00,4,16.27,16.27,0.000000,0.000000,0.000000,-7940.250000,-5216.290000                                                          | 2012-04-25 20:56:42.691965+00
 TotalPEAKMVal,-7940.250000,NumIntervals,2034,AvgRTMVAlue,-3.903761                                    
                                          | 2012-04-25 20:56:42.691965+00
 OUTPUT VARIABLES,AvgPurValue,-2.84,AvgSaleValue,-3.90,Profit,-1.06,ProfitPercentage,-106.00                                                     | 2012-04-25 20:56:42.691965+00
(3832 rows)

Why doesn't the current_timestamp value change within the pgpsql code? For every select from a table to compute, I am inserting into my debug_log dummy table. For all 3832 rows I have the same current_timestamp value which I was hoping to get by the following
 insert statement:
insert into debug_log ('some log text', current_timestamp);

But when I run this procedure from the psql command line, it takes 11 seconds....I gave a bigger date range than what I stated as (40 ms) in my first post hence the change to 11 seconds.

dev=> select calc_comp('emp_id_1','emp_id_100','DAM-RTM',to_date('2012-01-01','yyyy-mm-dd'),to_date('2012-02-01','yyyy-mm-dd'),'DOnPk','ptp-obligations',1.00, 0);
         crr_valuation         
-------------------------------
 0||-2.84|-3.90|-1.06|-106.00|
(1 row)

Time: 11685.828 ms.

The last input value 1/0 while calling the above function controls, if any log line messages
 should be inserted or not, 0=insert, 1 = do_not_insert. When I toggle the flag the overall timing did not change. Does it not some time in ms to write 3832 rows to a table?
Why is my current_timestamp value not changing. I was expecting the difference between the last row's timestamp value MINUS the first row's tiemstamp value to equal my 11.685 seconds. What is wrong here?

-Venki

  From: Samuel Gendler <sgendler@xxxxxxxxxxxxxxxx>
 To: Venki Ramachandran
 <venki_ramachandran@xxxxxxxxx> 
Cc: "pgsql-performance@xxxxxxxxxxxxxx" <pgsql-performance@xxxxxxxxxxxxxx> 
 Sent: Wednesday, April 25, 2012 12:36 PM
 Subject: Re:  Parallel Scaling of a pgplsql problem

On Wed, Apr 25, 2012 at 11:52 AM, Venki Ramachandran <venki_ramachandran@xxxxxxxxx> wrote:

Hi all:
Can someone please guide me as to how to solve this problem? If this is the wrong forum, please let me know which one to post this one in. I am new to Postgres (about 3 months into it)

I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program that does some computation. It takes in a date range and for one pair of personnel (two employees in a company) it calculates some values over the time period. It takes about 40ms (milli seconds) to complete and give me the answer. All good so far.

Now I have to run the same pgplsql on all possible combinations of employees and with 542 employees that is about say 300,000 unique pairs.

So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and show it on a screen. No
 user wants to wait for 3 hours,  they can probably wait for 10 minutes (even that is too much for a UI application). How do I solve this scaling problem? Can I have multiple parellel sessions and each session have multiple/processes that do a pair each at 40 ms and then collate the results. Does PostGres or pgplsql have any parallel computing capability.

The question is, how much of that 40ms is spent performing the calculation, how much is spent querying, and how much is function call overhead, and how much is round trip between the client and server with the query and results?  Depending upon the breakdown, it is entirely possible that the actual per-record multiplier can be kept down to a couple of milliseconds if you restructure things to query data in bulk and only call a single function to do the work.  If you get it down to 4ms, that's a 20 minute query.  Get it down to 1ms and you're looking at only 5 minutes for what would appear to be a fairly compute-intensive report over a relatively large dataset.