Re: Advice on best way to store a large amount of data in postgresql

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



That’s crazy only having 8GB memory when you have tables with over 100GBs. One general rule of thumb is have enough memory to hold the biggest index.

Sent from my iPad

> On Jan 9, 2023, at 3:23 AM, spiral <spiral@xxxxxxxxx> wrote:
> 
> Hello,
> 
> We have a table containing ~1.75 billion rows, using 170GB storage.
> The table schema is the following:
> 
> messages=# \d messages
>                 Table "public.messages"
>    Column    |  Type   | Collation | Nullable | Default 
> --------------+---------+-----------+----------+---------
> mid          | bigint  |           | not null | 
> channel      | bigint  |           | not null | 
> member       | integer |           |          | 
> sender       | bigint  |           | not null | 
> original_mid | bigint  |           |          | 
> guild        | bigint  |           |          | 
> Indexes:
>    "messages_pkey" PRIMARY KEY, btree (mid)
> 
> 
> This table is used essentially as a key-value store; rows are accessed
> only with `mid` primary key. Additionally, inserted rows may only be
> deleted, but never updated.
> 
> We only run the following queries:
> - INSERT INTO messages VALUES (...data...);
> - SELECT * FROM messages WHERE mid = $1;
> - DELETE FROM messages WHERE mid = $1;
> - DELETE FROM messages WHERE mid IN ($1...$n);
> - SELECT count(*) FROM messages;
> 
> For the "IN" query, it is possible for there to be up to 100
> parameters, and it is possible that none of them will match an existing
> row.
> 
> So, the problem: I don't know how to best store this data in
> postgres, or what system requirements would be needed.
> Originally, this table did not contain a substantial amount of data,
> and so I stored it in the same database as our CRUD user data. However,
> as the table became larger, cache was being allocated to (mostly
> unused) historical data from the `messages` table, and I decided to
> move the large table to its own postgres instance.
> 
> At the same time, I partitioned the table, with TimescaleDB's automatic
> time-series partitioning, because our data is essentially time-series
> (`mid` values are Twitter-style snowflakes) and it was said that
> partitioning would improve performance.
> This ended up being a mistake... shared_buffers memory usage went way
> up, from the 20GB of the previous combined database to 28GB for just
> the messages database, and trying to lower shared_buffers at all made
> the database start throwing "out of shared memory" errors when running
> DELETE queries. A TimescaleDB update did improve this, but 28GB is way
> more memory than I can afford to allocate to this database - instead of
> "out of shared memory", it gets OOM killed by the system.
> 
> What is the best course of action here?
> - Ideally, I would like to host this database on a machine with 4
>  (Ryzen) cores, 8GB RAM, and tiered storage (our cloud provider doesn't
>  support adding additional local storage to a VPS plan). Of course,
>  this seems very unrealistic, so it's not a requirement, but the
>  closer we can get to this, the better.
> - Is it a good idea to use table partitioning? I heard advice that one
>  should partition tables with above a couple million rows, but I don't
>  know how true this is. We have a table with ~6mil rows in our main
>  database that has somewhat slow lookups, but we also have a table
>  with ~13mil rows that has fast lookups, so I'm not sure.
> 
> Thanks
> spiral
> 
> 







[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux