Re: using a postgres table as a multi-writer multi-updater queue

"Steve Petrie, P.Eng." <apetrie@xxxxxxxxxxxx> · Wed, 25 Nov 2015 23:39:39 -0500

(Kindly forgive my top-post. My stupid email client 
software (Microsoft Outlook Express on Win XP) refuses to respect its own 
"Reply" option settings for inline reply text. I've been looking for a 
replacement email client but so far without success.)

* * *
* * *

> You don't ever want to delete from 
such a table so you need to set up something which allows you to truncate the 
tables when you no longer need them.

I am migrating a web PHP application (called 
ITS-ETO) from mysql to postgres. The app INSERTs a row into a postgres table 
<eto_sql_tb_session_www> to manage each session with a web browser. Here 
is the DDL for the session table:

  CREATE TABLE 
  its_eto.eto_sql_tb_session_www
(

  session_www_code       
  char(32)      NOT NULL UNIQUE PRIMARY 
  KEY,

  session_www_type       
  int           NOT 
  NULL,
  session_www_state      
  int           NOT 
  NULL,
  session_verify_code    
  char(7)       NOT NULL,

  session_www_serno      
  SERIAL        NOT NULL UNIQUE,

  session_target_serno   
  int           
  NULL,
  session_target_data    varchar(1000) 
  NULL,

  session_www_init_utc   timestamp     NOT 
  NULL,
  session_www_last_utc   
  timestamp     NOT NULL,
  session_www_expiry_utc 
  timestamp     NOT NULL,
  session_www_delete_utc 
  timestamp     NOT NULL,

  session_www_hit_count  
  int           NOT 
  NULL,
  session_www_act_seqno  
  int           
  NULL

  );

  CREATE INDEX ON 
  its_eto.eto_sql_tb_session_www (session_www_serno);
Using a "fuzzy" probability mechanism, some 
randomly-selected fraction of the HTTP requests that initiate a 
new session, also SELECT and DELETE expired rows from the session 
table. I naively assumed that the database server would automatically recycle 
the storage space dynamically released in this way.

Now, I'm reading in this forum that in 
fact, postgres does not efficiently automatically recycle storage space released 
by row DELETion.

* * *
* * *

My application is quite simple and will be 
supporting a modest workload, using a small amount of storage space, compared to 
the massive transaction rates and gigantic space usages, I'm reading about in 
this forum.

I do have the luxury of being able to shut down the 
application for a few minutes periodically e.g every 24 hours.

My plan was always, to avoid eventual exhaustion of 
the SERIAL sequence number integer value series, by swapping in during 
the periodic app shutdown, a freshly truncated postgres 
<eto_sql_tb_session_www> table.

Before going online, the freshly truncated postgres 
<eto_sql_tb_session_www> table will receive INSERTs of any 
active session rows copied over from the old <eto_sql_tb_session_www> 
table. The copied session rows will get new sequence numbers, but that won't 
matter, because a session row is referenced within each 
incoming HTTP request, not by its row serial number column 
<session_www_serno> integer value, but by a randomly-generated (MD5) 
32-character unique key column <session_www_code> 
value. 

So my question to this postgres forum is -- 
should I just remove from the online app the "fuzzy" probability mechanism, that DELETEs expired rows 
from the session table -- because the postgres server is not going to 
dynamically recycle the released storage space 
anyway?

Any comments appreciated.

Steve

----- Original Message ----- 

  From: 
  Tim Uckun 

  To: Merlin Moncure 
  Cc: John R Pierce ; PostgreSQL General 
  Sent: Wednesday, November 25, 2015 3:50 
  PM
  Subject: Re:  using a postgres 
  table as a multi-writer multi-updater queue

  I'll add my two cents.....

  I set up something similar a while ago. Here are my suggestions for what 
  they are worth.

  You don't ever want to delete from such a table so you need to set up 
  something which allows you to truncate the tables when you no longer need 
  them.  

  One way to accomplish this is with rules (not triggers, rules are 
  blazingly fast compared to triggers).    Set up a table inheritance 
  scheme based on whatever search criteria you have (date, sequence etc).   
  Set up a cron job to create the child tables well ahead and to set up the rule 
  redirecting the insert. For example let's say you have a date partitioned 
  table and you want to  keep a table for every day.  Your cron job 
  would run once a day and would create the next seven days worth of tables 
  (just incase the cron job fails to run for some reason) and would rewrite the 
  rule to insert into the table with a if then else type of logic.  This 
  method is preferable to the dynamic creation of the table name with string 
  concatenation because again it's significantly faster.  

  Another method I tried was to have one "primary" child table and 
  "archival" child tables and insert directly into the primal child table.  
  For example say you have a table called "Q".  You set up a table called 
  Q_in which inherits from Q.  Your code inserts into the Q_in table, you 
  select from the Q table.  On a periodic basis you do this 

  BEGIN TRANSACTION
LOCK TABLE Q_IN IN EXCLUSIVE MODE;
ALTER TABLE 
  Q_IN RENAME TO Q_SOME_DATETIME;
CREATE TABLE Q_IN
  (LIKE 
  Q_SOME_DATETIME INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING 
  INDEXES);
ALTER SEQUENCE q_in_id_seq OWNED BY q_in.id;
 -- Set some constraints so the 
  query optimizer knows what to do
  END TRANSACTION

  There is one other method which is the Kafka approach( You can use this 
  in addition to the above methods)

  Create N tables for incoming queue, each one has a sequence for their ID 
  number. N should be determined by how many clients you expect to run.  
  Create a rule which uses some hash function or round robin or randomly to 
  insert data into one of the tables.   Create a different table which 
  keeps track of client connections.   The clients use this table to keep 
  track of the last id fetched.  For example let's have I have three types 
  of processes that run on the incoming data p1,p2, p3  The table logs the 
  highest ID fetched from each table for each process. When a client connects it 
  connects to the table with the lowest used ID for that process,  and it 
  locks it for that client (not process because you can multiple clients running 
  each process), it processes the records, it updates the id, it unlocks the 
  table and it backs off for a few seconds.  The next client which woke up 
  goes through the same process and so on.  Both Apache Kafka and Amazon 
  kinesis use this approach.  One nice thing about this approach is that 
  you can put each table in it's own tablespace in it's own disk for higher 
  performance.

  One other thing. This is dangerous but you can turn off logging of the 
  tables, this will make the inserts much faster.

  Finally:

  There is no need to do any of this.  Kinesis is cheap, Kafka is 
  pretty awesome, Rabbit is crazy useful.  

  Hope this helps, ping me offline if you want more details.

  Cheers.

  On Tue, Nov 24, 2015 at 11:51 AM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:

  On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce 
    <pierce@xxxxxxxxxxxx> 
    wrote:
> On 11/23/2015 2:41 AM, Chris Withers 
    wrote:
>>
>>
>> If it's totally wrong, how should 
    I be looking to approach the problem?
>
> depending on where 
    these queue entries are coming from, I'd considering
> using a message 
    queueing system like AMS, MQseries, etc, rather than trying
> to use a 
    relational database table as a queue. your external data source(s)
> 
    would write messages to this queue, and you'd have 'subscriber' 
    processes
> that listen to the queue and process the messages, 
    inserting persistent data
> into the database as 
    needed.

I just don't agree with this generalization.  
    Keeping the state of the
queue in the database has a lot of advantages 
    and is a lot easier to
deal with from a programming perspective 
    especially if SQL is your
core competency.  Being able to produce 
    and consume in SQL based on
other relational datasources 
    is...elegant.

Specialized queue systems are a very heavy dependency 
    and adding a new
server to your platform to mange queues is not something 
    to take
lightly.  This advice also applies to scheduling systems 
    like quartz,
specialized search like solr and elastisearch, and 
    distributed data
platforms like hadoop.  I've used all of these 
    things and have tended
to wish I had just used the database instead in 
    just about every case.

Also, check out pgq (https://wiki.postgresql.org/wiki/PGQ_Tutorial).
Personally, 
    I tend to roll my own queues.   It's not difficult.

merlin

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To 
    make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general