Search Postgresql Archives

Re: using a postgres table as a multi-writer multi-updater queue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




(Kindly forgive my top-post. My stupid email client software (Microsoft Outlook Express on Win XP) refuses to respect its own "Reply" option settings for inline reply text. I've been looking for a replacement email client but so far without success.)
 
* * *
* * *
 
> You don't ever want to delete from such a table so you need to set up something which allows you to truncate the tables when you no longer need them.
 
I am migrating a web PHP application (called ITS-ETO) from mysql to postgres. The app INSERTs a row into a postgres table <eto_sql_tb_session_www> to manage each session with a web browser. Here is the DDL for the session table:
CREATE TABLE its_eto.eto_sql_tb_session_www
(
  session_www_code       char(32)      NOT NULL UNIQUE PRIMARY KEY,
 
  session_www_type       int           NOT NULL,
  session_www_state      int           NOT NULL,
  session_verify_code    char(7)       NOT NULL,
 
  session_www_serno      SERIAL        NOT NULL UNIQUE,
 
  session_target_serno   int           NULL,
  session_target_data    varchar(1000) NULL,
 
  session_www_init_utc   timestamp     NOT NULL,
  session_www_last_utc   timestamp     NOT NULL,
  session_www_expiry_utc timestamp     NOT NULL,
  session_www_delete_utc timestamp     NOT NULL,
  session_www_hit_count  int           NOT NULL,
  session_www_act_seqno  int           NULL
 
);
 
CREATE INDEX ON its_eto.eto_sql_tb_session_www (session_www_serno);
Using a "fuzzy" probability mechanism, some randomly-selected fraction of the HTTP requests that initiate a new session, also SELECT and DELETE expired rows from the session table. I naively assumed that the database server would automatically recycle the storage space dynamically released in this way.
 
Now, I'm reading in this forum that in fact, postgres does not efficiently automatically recycle storage space released by row DELETion.
 
* * *
* * *
 
My application is quite simple and will be supporting a modest workload, using a small amount of storage space, compared to the massive transaction rates and gigantic space usages, I'm reading about in this forum.
 
I do have the luxury of being able to shut down the application for a few minutes periodically e.g every 24 hours.
 
My plan was always, to avoid eventual exhaustion of the SERIAL sequence number integer value series, by swapping in during the periodic app shutdown, a freshly truncated postgres <eto_sql_tb_session_www> table.
 
Before going online, the freshly truncated postgres <eto_sql_tb_session_www> table will receive INSERTs of any active session rows copied over from the old <eto_sql_tb_session_www> table. The copied session rows will get new sequence numbers, but that won't matter, because a session row is referenced within each incoming HTTP request, not by its row serial number column <session_www_serno> integer value, but by a randomly-generated (MD5) 32-character unique key column <session_www_code> value. 
 
So my question to this postgres forum is -- should I just remove from the online app the "fuzzy" probability mechanism, that DELETEs expired rows from the session table -- because the postgres server is not going to dynamically recycle the released storage space anyway?
 
Any comments appreciated.
 
Steve
 
----- Original Message -----
From: Tim Uckun
Sent: Wednesday, November 25, 2015 3:50 PM
Subject: Re: using a postgres table as a multi-writer multi-updater queue

I'll add my two cents.....

I set up something similar a while ago. Here are my suggestions for what they are worth.

You don't ever want to delete from such a table so you need to set up something which allows you to truncate the tables when you no longer need them.  

One way to accomplish this is with rules (not triggers, rules are blazingly fast compared to triggers).    Set up a table inheritance scheme based on whatever search criteria you have (date, sequence etc).   Set up a cron job to create the child tables well ahead and to set up the rule redirecting the insert. For example let's say you have a date partitioned table and you want to  keep a table for every day.  Your cron job would run once a day and would create the next seven days worth of tables (just incase the cron job fails to run for some reason) and would rewrite the rule to insert into the table with a if then else type of logic.  This method is preferable to the dynamic creation of the table name with string concatenation because again it's significantly faster.  

Another method I tried was to have one "primary" child table and "archival" child tables and insert directly into the primal child table.  For example say you have a table called "Q".  You set up a table called Q_in which inherits from Q.  Your code inserts into the Q_in table, you select from the Q table.  On a periodic basis you do this 

BEGIN TRANSACTION
LOCK TABLE Q_IN IN EXCLUSIVE MODE;
ALTER TABLE Q_IN RENAME TO Q_SOME_DATETIME;
CREATE TABLE Q_IN
  (LIKE Q_SOME_DATETIME INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES);
ALTER SEQUENCE q_in_id_seq OWNED BY q_in.id;
 -- Set some constraints so the query optimizer knows what to do
END TRANSACTION


There is one other method which is the Kafka approach( You can use this in addition to the above methods)

Create N tables for incoming queue, each one has a sequence for their ID number. N should be determined by how many clients you expect to run.  Create a rule which uses some hash function or round robin or randomly to insert data into one of the tables.   Create a different table which keeps track of client connections.   The clients use this table to keep track of the last id fetched.  For example let's have I have three types of processes that run on the incoming data p1,p2, p3  The table logs the highest ID fetched from each table for each process. When a client connects it connects to the table with the lowest used ID for that process,  and it locks it for that client (not process because you can multiple clients running each process), it processes the records, it updates the id, it unlocks the table and it backs off for a few seconds.  The next client which woke up goes through the same process and so on.  Both Apache Kafka and Amazon kinesis use this approach.  One nice thing about this approach is that you can put each table in it's own tablespace in it's own disk for higher performance.

One other thing. This is dangerous but you can turn off logging of the tables, this will make the inserts much faster.

Finally:

There is no need to do any of this.  Kinesis is cheap, Kafka is pretty awesome, Rabbit is crazy useful.  

Hope this helps, ping me offline if you want more details.

Cheers.


On Tue, Nov 24, 2015 at 11:51 AM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:
On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce <pierce@xxxxxxxxxxxx> wrote:
> On 11/23/2015 2:41 AM, Chris Withers wrote:
>>
>>
>> If it's totally wrong, how should I be looking to approach the problem?
>
> depending on where these queue entries are coming from, I'd considering
> using a message queueing system like AMS, MQseries, etc, rather than trying
> to use a relational database table as a queue. your external data source(s)
> would write messages to this queue, and you'd have 'subscriber' processes
> that listen to the queue and process the messages, inserting persistent data
> into the database as needed.

I just don't agree with this generalization.  Keeping the state of the
queue in the database has a lot of advantages and is a lot easier to
deal with from a programming perspective especially if SQL is your
core competency.  Being able to produce and consume in SQL based on
other relational datasources is...elegant.

Specialized queue systems are a very heavy dependency and adding a new
server to your platform to mange queues is not something to take
lightly.  This advice also applies to scheduling systems like quartz,
specialized search like solr and elastisearch, and distributed data
platforms like hadoop.  I've used all of these things and have tended
to wish I had just used the database instead in just about every case.

Also, check out pgq (https://wiki.postgresql.org/wiki/PGQ_Tutorial).
Personally, I tend to roll my own queues.   It's not difficult.

merlin


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux