Query Parallelizing with dblink

AI Rumman <rummandba@xxxxxxxxx> · Wed, 30 Jan 2013 10:27:07 -0500

Hi All,
I made an idea to utilize multiple core for Postgresql query. I used dblink to achieve this task.
You may find it:
http://www.rummandba.com/2013/01/postgresql-query-parallelizing-with.html

It is just a basic idea and completely usable for my data model. But I share it with you all so that you may add some more good ideas here.

------ Added the document here ------------------------------------------------------------------
I have three tables with almost 10 millions of records in each. The tables are:
customer 
account
tickets
"customer" table holds record for all kind of customers which are related to account or tickets.
We need to generate a report with of customer and its related accounts or tickets.
The basic sql is like:
select *
from
(
select c.custid,
case when a.accountid is not null then
      a.accountid
     when t.ticketid is not null then
      t.ticketid
end
as relatedid       
from customer as c
left join account as a  on c.custid = a.accountid and type ='Accounts'
left  join tickets as t on c.custid = t.ticketid and type ='HelpDesk'
where c.deleted = 0
) as q
where relatedid is not null

I have all the required indexes. But the query was taking too much time. 
One
 of the bottleneck that I always feel with Postgresql is its lack of 
query parallelism technique. Good news is that, the great developers are
 working on it.
However, I have to improve the query performance 
at this moment. So I make a plan to divide the query in two parts and 
then execute each part asynchronously and then collect the result.

To
 achieve this, I make the function qry_parallel_fn. This function create
 two separate dblink connection conn1 and conn2 and execute two 
different queries in async mode.
There is a while loop which 
checks if both the connections have completed the task or not. If yes, 
then the function return results.

CREATE OR REPLACE FUNCTION qry_parallel_fn() RETURNS SETOF RECORD AS $$
DECLARE
   v_qry1 text;
   v_qry2 text;
   cur1 cursor is
   select *
   from dblink_get_result('conn1') as t1(custid int, relatedid int);

   cur2 cursor is
   select *
   from dblink_get_result('conn2') as t1(custid int, relatedid int);

   v_closed smallint;

BEGIN

  v_qry1 := 'select custid, accountid as relatedid from customer c inner
 join account a on c.custid = a.accountid where c.deleted = 0';
     RAISE NOTICE 'vqry1 = %' , v_qry1;

  v_qry2 := 'select custid, ticketid as relatedid from customer c inner 
join tickets as t on c.custid = t.ticketid where c.deleted = 0';
   PERFORM dblink_connect('conn1','dbname=rumman');
   PERFORM dblink_connect('conn2','dbname=rumman');
     PERFORM dblink_send_query('conn1',v_qry1);
     PERFORM dblink_send_query('conn2',v_qry2);

     v_closed := 0;
     WHILE v_closed <> 2 loop 
       if check_conn_is_busy('conn1') = 0 then
          v_closed := v_closed + 1;
       end if;
       if check_conn_is_busy('conn2') = 0 then
          v_closed := v_closed + 1;
       end if;
     END LOOP;

     FOR rec IN cur1
     LOOP
       RETURN NEXT rec;
     END LOOP;

     FOR rec IN cur2
     LOOP
       RETURN NEXT rec;
     END LOOP;

     PERFORM dblink_disconnect('conn1');
     PERFORM dblink_disconnect('conn2');

     RETURN;
END;
$$
language 'plpgsql'

--Execute
--select * from test_fn() as t1(c int, d int);
-- select count(*) from test_fn() as t1(c int, d int);

CREATE OR REPLACE FUNCTION check_conn_is_busy(conn text) RETURNS INT AS $$
DECLARE
  v int;
BEGIN
   SELECT dblink_is_busy(conn) INTO v;
   RETURN v;
END;
$$
language 'plpgsql'

I
 was monitoring the server performance and found that it was using two 
cpu cores to get the result and improve the query timing a bit.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Let me know your thoughts on it.

Thanks.