Search Postgresql Archives

Re: using a postgres table as a multi-writer multi-updater queue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'll add my two cents.....

I set up something similar a while ago. Here are my suggestions for what they are worth.

You don't ever want to delete from such a table so you need to set up something which allows you to truncate the tables when you no longer need them.  

One way to accomplish this is with rules (not triggers, rules are blazingly fast compared to triggers).    Set up a table inheritance scheme based on whatever search criteria you have (date, sequence etc).   Set up a cron job to create the child tables well ahead and to set up the rule redirecting the insert. For example let's say you have a date partitioned table and you want to  keep a table for every day.  Your cron job would run once a day and would create the next seven days worth of tables (just incase the cron job fails to run for some reason) and would rewrite the rule to insert into the table with a if then else type of logic.  This method is preferable to the dynamic creation of the table name with string concatenation because again it's significantly faster.  

Another method I tried was to have one "primary" child table and "archival" child tables and insert directly into the primal child table.  For example say you have a table called "Q".  You set up a table called Q_in which inherits from Q.  Your code inserts into the Q_in table, you select from the Q table.  On a periodic basis you do this 

BEGIN TRANSACTION
LOCK TABLE Q_IN IN EXCLUSIVE MODE;
ALTER TABLE Q_IN RENAME TO Q_SOME_DATETIME;
CREATE TABLE Q_IN
  (LIKE Q_SOME_DATETIME INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES);
ALTER SEQUENCE q_in_id_seq OWNED BY q_in.id;
 -- Set some constraints so the query optimizer knows what to do
END TRANSACTION


There is one other method which is the Kafka approach( You can use this in addition to the above methods)

Create N tables for incoming queue, each one has a sequence for their ID number. N should be determined by how many clients you expect to run.  Create a rule which uses some hash function or round robin or randomly to insert data into one of the tables.   Create a different table which keeps track of client connections.   The clients use this table to keep track of the last id fetched.  For example let's have I have three types of processes that run on the incoming data p1,p2, p3  The table logs the highest ID fetched from each table for each process. When a client connects it connects to the table with the lowest used ID for that process,  and it locks it for that client (not process because you can multiple clients running each process), it processes the records, it updates the id, it unlocks the table and it backs off for a few seconds.  The next client which woke up goes through the same process and so on.  Both Apache Kafka and Amazon kinesis use this approach.  One nice thing about this approach is that you can put each table in it's own tablespace in it's own disk for higher performance.

One other thing. This is dangerous but you can turn off logging of the tables, this will make the inserts much faster.

Finally:

There is no need to do any of this.  Kinesis is cheap, Kafka is pretty awesome, Rabbit is crazy useful.  

Hope this helps, ping me offline if you want more details.

Cheers.


On Tue, Nov 24, 2015 at 11:51 AM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:
On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce <pierce@xxxxxxxxxxxx> wrote:
> On 11/23/2015 2:41 AM, Chris Withers wrote:
>>
>>
>> If it's totally wrong, how should I be looking to approach the problem?
>
> depending on where these queue entries are coming from, I'd considering
> using a message queueing system like AMS, MQseries, etc, rather than trying
> to use a relational database table as a queue. your external data source(s)
> would write messages to this queue, and you'd have 'subscriber' processes
> that listen to the queue and process the messages, inserting persistent data
> into the database as needed.

I just don't agree with this generalization.  Keeping the state of the
queue in the database has a lot of advantages and is a lot easier to
deal with from a programming perspective especially if SQL is your
core competency.  Being able to produce and consume in SQL based on
other relational datasources is...elegant.

Specialized queue systems are a very heavy dependency and adding a new
server to your platform to mange queues is not something to take
lightly.  This advice also applies to scheduling systems like quartz,
specialized search like solr and elastisearch, and distributed data
platforms like hadoop.  I've used all of these things and have tended
to wish I had just used the database instead in just about every case.

Also, check out pgq (https://wiki.postgresql.org/wiki/PGQ_Tutorial).
Personally, I tend to roll my own queues.   It's not difficult.

merlin


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux