A couple of months ago I was tasked with setting up a new database to hold the results of some new data processing scripts a colleague of my was developing. As I knew this would be a decent amount of data, I did my best to set up a system that would perform well, assigning the machine 20 processor cores to allow for parallel processing workflows with retrieving/processing the data, 128GB of RAM, and 6TB of NMVe storage. On this I installed the latest postgresql (13), did some performance tuning to the settings according to various guides I found online, set up a table to hold the data, and created a number of indexes that seemed appropriate for the queries I anticipated. At this point, performance is fine, though not spectacular, with a “normal” query of the currently 400GB database taking a couple of seconds. “Normal” in this case being defined as the most common SELECT query run against the data to plot the current and historical trends. Users are few in number (and will likely remain that way), and it can keep up with the insert rate without difficulty - data is processed and dumped to the database in 10 minute “chunks", and generally the run is only taking around 2 minutes, so there is a fair amount of overhead available there. However, this database is growing at a rate of 14GB/day (as measured via looking at df stats), and the script owner is taking about wanting to process and pull in “all the historical data we have access to”, which would go back several years, not to mention the probable desire to keep things running into the foreseeable future. This amounts to a volume of data that is WAY beyond anything I have any experience with, especially since I am primarily a software engineer, not a DB admin (though administering a DB is often involved with my role, as the software I engineer often needs a data source). As such, I am looking for advice from people who have dealt with such large volumes of data as far as how I should architect things now, before it gets to out-of-hand, to best ensure optimal performance in the future. Some additional data that may or may not be relevant: - The workload is largely insert intensive. Every 10 minutes, 1Hz records are inserted for up to three channels from about 118 stations - so up to around 212,000 new records inserted every 10 minutes. In practice, the number is generally somewhat lower as not all stations actually have three channels of data, but the majority do. - The largest SELECT workflow currently is a script that pulls all available data for ONE channel of each station (currently, I suspect that will change to all channels in the near future), and runs some post-processing machine learning algorithms on it. This script (written in R, if that makes a difference) currently takes around half an hour to run, and is run once every four hours. I would estimate about 50% of the run time is data retrieval and the rest doing its own thing. I am only responsible for integrating this script with the database, what it does with the data (and therefore how long that takes, as well as what data is needed), is up to my colleague. I have this script running on the same machine as the DB to minimize data transfer times. - Other than the above processing script, workload is fairly light, with only one or two users occasionally pulling up graphs of the data for a handful of channels on a handful of stations at most (singe station/channel at a time). The time range of data needed for these graphs may vary from a few seconds (looking at a specific event, which may or may not be recent) to several years (looking at historical trends). As such, full-resolution data needs to be available quickly - that is, on a user-friendly time scale - for any period of time for which we have data. - Some stats from the pg_stat_user_tables: volcano_seismology=# SELECT relname, last_vacuum, last_analyze,last_autovacuum,last_autoanalyze, autovacuum_count, autoanalyze_count, n_dead_tup,idx_scan,idx_tup_fetch,n_tup_ins,n_mod_since_analyze,n_ins_since_vacuum FROM pg_stat_user_tables WHERE relname='data'; -[ RECORD 1 ]-------+------------------------------ relname | data last_vacuum | 2021-09-29 17:33:19.269922+00 last_analyze | 2021-09-29 17:33:32.281416+00 last_autovacuum | 2021-10-04 12:28:38.250069+00 last_autoanalyze | 2021-10-04 15:05:29.745062+00 autovacuum_count | 30 autoanalyze_count | 37 n_dead_tup | 122031 idx_scan | 1584854 idx_tup_fetch | 44873856136 n_tup_ins | 245597916 n_mod_since_analyze | 1901231 n_ins_since_vacuum | 5958840 - and the data table definition: Column | Type | Collation | Nullable | Default ----------------+--------------------------+-----------+----------+---------------------------------- id | bigint | | not null | nextval('data_id_seq'::regclass) datetime | timestamp with time zone | | not null | freq_max1 | double precision | | | freq_max5 | double precision | | | freq_max10 | double precision | | | freq_max20 | double precision | | | freq_max30 | double precision | | | freq_max40 | double precision | | | freq_max50 | double precision | | | freq_max100 | double precision | | | sd_freq_max5 | double precision | | | sd_freq_max10 | double precision | | | sd_freq_max20 | double precision | | | sd_freq_max30 | double precision | | | sd_freq_max40 | double precision | | | sd_freq_max50 | double precision | | | sd_freq_max100 | double precision | | | ssa_max1 | double precision | | | ssa_max5 | double precision | | | ssa_max10 | double precision | | | ssa_max20 | double precision | | | ssa_max30 | double precision | | | ssa_max40 | double precision | | | ssa_max50 | double precision | | | ssa_max100 | double precision | | | sd_ssa_max5 | double precision | | | sd_ssa_max10 | double precision | | | sd_ssa_max20 | double precision | | | sd_ssa_max30 | double precision | | | sd_ssa_max40 | double precision | | | sd_ssa_max50 | double precision | | | sd_ssa_max100 | double precision | | | station | smallint | | not null | channel | character varying(6) | | not null | epoch | integer | | | rsam | double precision | | | sd_rsam | double precision | | | Indexes: "data_pkey" PRIMARY KEY, btree (id) "date_station_channel_idx" UNIQUE, btree (datetime, station, channel) "station_channel_epoch_idx" UNIQUE, btree (station, channel, epoch) "data_station_channel_idx" btree (station, channel) "station_data_idx" btree (station) "station_date_idx" btree (station, datetime) Foreign-key constraints: "data_station_fkey" FOREIGN KEY (station) REFERENCES stations(id) Triggers: update_epoch BEFORE INSERT OR UPDATE OF datetime ON data FOR EACH ROW EXECUTE FUNCTION store_epoch() --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 |