(Kindly forgive my top-post. My stupid email client
software (Microsoft Outlook Express on Win XP) refuses to respect its own
"Reply" option settings for inline reply text. I've been looking for a
replacement email client but so far without success.)
* * *
* * *
> You don't ever want to delete from
such a table so you need to set up something which allows you to truncate the
tables when you no longer need them.
I am migrating a web PHP application (called
ITS-ETO) from mysql to postgres. The app INSERTs a row into a postgres table
<eto_sql_tb_session_www> to manage each session with a web browser. Here
is the DDL for the session table:
CREATE TABLE
its_eto.eto_sql_tb_session_www (
session_www_code
char(32) NOT NULL UNIQUE PRIMARY
KEY,
session_www_type
int NOT
NULL, session_www_state
int NOT
NULL, session_verify_code
char(7) NOT NULL,
session_www_serno
SERIAL NOT NULL UNIQUE,
session_target_serno
int
NULL, session_target_data varchar(1000)
NULL,
session_www_init_utc timestamp NOT
NULL, session_www_last_utc
timestamp NOT NULL, session_www_expiry_utc
timestamp NOT NULL, session_www_delete_utc
timestamp NOT NULL,
session_www_hit_count
int NOT
NULL, session_www_act_seqno
int
NULL
);
CREATE INDEX ON
its_eto.eto_sql_tb_session_www (session_www_serno);
Using a "fuzzy" probability mechanism, some
randomly-selected fraction of the HTTP requests that initiate a
new session, also SELECT and DELETE expired rows from the session
table. I naively assumed that the database server would automatically recycle
the storage space dynamically released in this way.
Now, I'm reading in this forum that in
fact, postgres does not efficiently automatically recycle storage space released
by row DELETion.
* * *
* * *
My application is quite simple and will be
supporting a modest workload, using a small amount of storage space, compared to
the massive transaction rates and gigantic space usages, I'm reading about in
this forum.
I do have the luxury of being able to shut down the
application for a few minutes periodically e.g every 24 hours.
My plan was always, to avoid eventual exhaustion of
the SERIAL sequence number integer value series, by swapping in during
the periodic app shutdown, a freshly truncated postgres
<eto_sql_tb_session_www> table.
Before going online, the freshly truncated postgres
<eto_sql_tb_session_www> table will receive INSERTs of any
active session rows copied over from the old <eto_sql_tb_session_www>
table. The copied session rows will get new sequence numbers, but that won't
matter, because a session row is referenced within each
incoming HTTP request, not by its row serial number column
<session_www_serno> integer value, but by a randomly-generated (MD5)
32-character unique key column <session_www_code>
value.
So my question to this postgres forum is --
should I just remove from the online app the "fuzzy" probability mechanism, that DELETEs expired rows
from the session table -- because the postgres server is not going to
dynamically recycle the released storage space
anyway?
Any comments appreciated.
Steve
----- Original Message -----
Sent: Wednesday, November 25, 2015 3:50
PM
Subject: Re: using a postgres
table as a multi-writer multi-updater queue
I'll add my two cents.....
I set up something similar a while ago. Here are my suggestions for what
they are worth.
You don't ever want to delete from such a table so you need to set up
something which allows you to truncate the tables when you no longer need
them.
One way to accomplish this is with rules (not triggers, rules are
blazingly fast compared to triggers). Set up a table inheritance
scheme based on whatever search criteria you have (date, sequence etc).
Set up a cron job to create the child tables well ahead and to set up the rule
redirecting the insert. For example let's say you have a date partitioned
table and you want to keep a table for every day. Your cron job
would run once a day and would create the next seven days worth of tables
(just incase the cron job fails to run for some reason) and would rewrite the
rule to insert into the table with a if then else type of logic. This
method is preferable to the dynamic creation of the table name with string
concatenation because again it's significantly faster.
Another method I tried was to have one "primary" child table and
"archival" child tables and insert directly into the primal child table.
For example say you have a table called "Q". You set up a table called
Q_in which inherits from Q. Your code inserts into the Q_in table, you
select from the Q table. On a periodic basis you do this
BEGIN TRANSACTION LOCK TABLE Q_IN IN EXCLUSIVE MODE; ALTER TABLE
Q_IN RENAME TO Q_SOME_DATETIME; CREATE TABLE Q_IN (LIKE
Q_SOME_DATETIME INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING
INDEXES); ALTER SEQUENCE q_in_id_seq OWNED BY q_in.id; -- Set some constraints so the
query optimizer knows what to do
END TRANSACTION
There is one other method which is the Kafka approach( You can use this
in addition to the above methods)
Create N tables for incoming queue, each one has a sequence for their ID
number. N should be determined by how many clients you expect to run.
Create a rule which uses some hash function or round robin or randomly to
insert data into one of the tables. Create a different table which
keeps track of client connections. The clients use this table to keep
track of the last id fetched. For example let's have I have three types
of processes that run on the incoming data p1,p2, p3 The table logs the
highest ID fetched from each table for each process. When a client connects it
connects to the table with the lowest used ID for that process, and it
locks it for that client (not process because you can multiple clients running
each process), it processes the records, it updates the id, it unlocks the
table and it backs off for a few seconds. The next client which woke up
goes through the same process and so on. Both Apache Kafka and Amazon
kinesis use this approach. One nice thing about this approach is that
you can put each table in it's own tablespace in it's own disk for higher
performance.
One other thing. This is dangerous but you can turn off logging of the
tables, this will make the inserts much faster.
Finally:
There is no need to do any of this. Kinesis is cheap, Kafka is
pretty awesome, Rabbit is crazy useful.
Hope this helps, ping me offline if you want more details.
Cheers.
|