Re: Advice on best way to store a large amount of data in postgresql

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Mon, 9 Jan 2023 11:56:47 -0600

On Sun, Jan 08, 2023 at 07:02:01AM -0500, spiral wrote:
> This table is used essentially as a key-value store; rows are accessed
> only with `mid` primary key. Additionally, inserted rows may only be
> deleted, but never updated.
> 
> We only run the following queries:
> - INSERT INTO messages VALUES (...data...);
> - SELECT * FROM messages WHERE mid = $1;
> - DELETE FROM messages WHERE mid = $1;
> - DELETE FROM messages WHERE mid IN ($1...$n);
> - SELECT count(*) FROM messages;

Great - it's good to start with the queries to optimize.

Are you using the extended query protocol with "bind" parameters, or are they
escaped and substituted by the client library ?

> So, the problem: I don't know how to best store this data in
> postgres, or what system requirements would be needed.
> Originally, this table did not contain a substantial amount of data,
> and so I stored it in the same database as our CRUD user data. However,
> as the table became larger, cache was being allocated to (mostly
> unused) historical data from the `messages` table, and I decided to
> move the large table to its own postgres instance.
> 
> At the same time, I partitioned the table, with TimescaleDB's automatic
> time-series partitioning, because our data is essentially time-series
> (`mid` values are Twitter-style snowflakes) and it was said that
> partitioning would improve performance.
> This ended up being a mistake... shared_buffers memory usage went way
> up, from the 20GB of the previous combined database to 28GB for just
> the messages database, and trying to lower shared_buffers at all made
> the database start throwing "out of shared memory" errors when running
> DELETE queries. A TimescaleDB update did improve this, but 28GB is way
> more memory than I can afford to allocate to this database - instead of
> "out of shared memory", it gets OOM killed by the system.

Can you avoid using DELETE and instead use DROP ?  I mean, can you
arrange your partitioning such that the things to be dropped are all in
one partition, to handle in bulk ?  That's one of the main reasons for
using partitioning.

(Or, as a worse option, if you need to use DELETE, can you change the
query to DELETE one MID at a time, and loop over MIDs?)

What version of postgres is it ?  Ah, I found that you reported the same thing
at least one other place.  (It'd be useful to include here that information as
well as the prior discussion with other product/vendor).

https://github.com/timescale/timescaledb/issues/5075

In this other issue report, you said that you increased
max_locks_per_transaction.  I suppose you need to increase it further,
or decrease your chunk size.  How many "partitions" do you have
(actually, timescale uses inheritance) ?

-- 
Justin