PL/pgSQL Loop Vs. Batch Update

David Wheeler <david@xxxxxxxxxxxxxx> · Tue, 25 Apr 2006 09:44:55 -0700

Fellow PostgreSQLers,

This post is longish and has a bit of code, but here's my question up- 
front: Why are batch queries in my PL/pgSQL functions no faster than  
iterating over a loop and executing a series of queries for each  
iteration of the loop? Are batch queries or array or series  
generation slow in PL/pgSQL?

Now, for the details of my problem. For managing an ordered many-to- 
many relations between blog entries and tags, I created these tables:

  CREATE TABLE entry (
    id SERIAL PRIMARY KEY,
    title text,
    content text
  );

  CREATE TABLE tag (
    id SERIAL PRIMARY KEY,
    name text
  );

  CREATE TABLE entry_coll_tag (
    entry_id integer REFERENCES entry(id)
                     ON UPDATE CASCADE
                     ON DELETE CASCADE,
    tag_id   integer REFERENCES tag(id)
                     ON UPDATE CASCADE
                     ON DELETE CASCADE,
    ord      smallint,
    PRIMARY KEY (entry_id, tag_id)
  );

To make it easy to associate an entry with a bunch of tags in a  
single query, I wrote this PL/pgSQL function:

  CREATE OR REPLACE FUNCTION entry_coll_tag_set (
      obj_id   integer,
      coll_ids integer[]
  ) RETURNS VOID AS $$
  DECLARE
    -- For checking to see if a tuple was updated.
    update_count smallint;
    -- For looping through the array.
    iloop integer       := 1;
  BEGIN
      -- Lock the containing object tuple to prevernt inserts into the
      -- collection table.
      PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;

      -- Update possible existing record with the current sequence so
      -- as to avoid unique constraint violations. We just set it to a
      -- negative number, since negative numbers are never used for
      -- ord.
      UPDATE entry_coll_tag
      SET    ord = -ord
      WHERE  entry_id = obj_id;

      -- Loop through the tag IDs to associate with the entry ID.
      while coll_ids[iloop] is not null loop
          -- Update possible existing collection record.
          UPDATE entry_coll_tag
          SET    ord = iloop
          WHERE  entry_id = obj_id
                 AND tag_id = coll_ids[iloop];

          -- Only insert the item if nothing was updated.
          IF FOUND IS false THEN
              -- Insert a new record.
              INSERT INTO entry_coll_tag (entry_id, tag_id, ord)
              VALUES (obj_id, coll_ids[iloop], iloop);
          END IF;
          iloop := iloop + 1;
      END loop;

      -- Delete any remaining tuples.
      DELETE FROM entry_coll_tag
      WHERE  entry_id = obj_id AND ord < 0;
  END;
  $$ LANGUAGE plpgsql SECURITY DEFINER;

Josh Berkus kindly reviewed it, and suggested that I take advantage  
of generate_series() and batch updates instead of looping through the  
results. Here's the revised version:

  CREATE OR REPLACE FUNCTION entry_coll_tag_set (
      obj_id   integer,
      coll_ids integer[]
  ) RETURNS VOID AS $$
  BEGIN
      -- Lock the containing object tuple to prevernt inserts into the
      -- collection table.
      PERFORM true FROM entry WHERE id = obj_id FOR UPDATE;

      -- First set all ords to negative value to prevent unique index
      -- violations.
      UPDATE entry_coll_tag
      SET    ord = -ord
      WHERE  entry_id = obj_id;

      IF FOUND IS false THEN
          -- There are no existing tuples, so just insert the new ones.
          INSERT INTO entry_coll_tag (entry_id, tag_id, ord)
          SELECT obj_id, coll_ids[gs.ser], gs.ser
          FROM   generate_series(1, array_upper(coll_ids, 1))
                 AS gs(ser);
      ELSE
          -- First, update the existing tuples to new ord values.
          UPDATE entry_coll_tag SET ord = ser
          FROM   (
              SELECT gs.ser, coll_ids[gs.ser] as move_tag
              FROM   generate_series(1, array_upper(coll_ids, 1))
                     AS gs(ser)
          ) AS expansion
          WHERE move_tag = entry_coll_tag.tag_id
                AND entry_id = obj_id;

          -- Now insert the new tuples.
          INSERT INTO entry_coll_tag (entry_id, tag_id, ord )
          SELECT obj_id, coll_ids[gs.ser], gs.ser
          FROM   generate_series(1, array_upper(coll_ids, 1)) AS gs 
(ser)
          WHERE  coll_ids[gs.ser] NOT IN (
              SELECT tag_id FROM entry_coll_tag ect2
              WHERE  entry_id = obj_id
          );

          -- Delete any remaining tuples.
          DELETE FROM entry_coll_tag
          WHERE  entry_id = obj_id AND ord < 0;
      END IF;
  END;
  $$ LANGUAGE plpgsql SECURITY DEFINER;

Josh thought that the replacement of the loop with a couple of batch  
queries would be an order of magnitude faster. So I happily ran my  
benchmark comparing the two approaches. The benchmark actually tests  
two different functions that had a similar refactoring, as well as  
two other functions that are quite simple. Before the tests, I  
inserted 100,000 entry records, and a random number of tag records  
(1-10 for each entry) and related entry_coll_tag records, which came  
out to around 500,000 entries in entry_coll_tag. I ran VACUUM and  
ANALYZE before each test, and got the following results:

 func: 13.67 wallclock seconds (0.11 usr + 1.82 sys = 1.93 CPU) @  
21.95/s
func2: 13.68 wallclock seconds (0.10 usr + 1.71 sys = 1.81 CPU) @  
21.93/s
 perl: 41.14 wallclock seconds (0.26 usr + 6.94 sys = 7.20 CPU) @  
7.29/s

"func" is my original version of the function with the loop, "func2"  
is the refactored version with the batch queries, and "perl" is my  
attempt to do roughly the same thing in Perl. The benefits of using  
the functions over doing the work in app space is obvious, but there  
seems to be little difference between the two function approaches. If  
you want to see the complete benchmark SQL and test script, it's here:

  http://www.justatheory.com/code/ 
bench_plpgsql_collection_functions2.tgz

So my question is quite simple: Why was the batch query approach no  
faster than the loop approach? I assume that if my tests managed 100  
or 1000 rows in the collection table, the batch approach would show  
an improvement, but why would it not when I was using it to manage  
8-10 rows? Have I missed something here?

Thanks,

David