Guessing the “sd” is "standard deviation”? Any chance
those stddevs are easily calculable from base data?
Could cut your table size in half (and put those 20
cores to work on the reporting).
Possible - I’d have to dig into that with the script
author. I was just handed an R script (I don’t work with R…)
and told here’s the data it needs, here’s the output we need
stored in the DB. I then spent just enough time with the
script to figure out how to hook up the I/O. The schema is
pretty much just a raw dump of the output - I haven’t really
spent any resources figuring out what, exactly, the data is.
Maybe I should :-)
And I wonder if the last three indices are
strictly necessary? They take disc space too.
Not sure. Here’s the output from pg_stat_all_indexes:
volcano_seismology=# select *
from pg_stat_all_indexes where relname='data';
relid | indexrelid | schemaname | relname |
indexrelname | idx_scan | idx_tup_read |
idx_tup_fetch
-------+------------+------------+---------+---------------------------+----------+--------------+---------------
19847 | 19869 | public | data | data_pkey
| 0 | 0 | 0
19847 | 19873 | public | data |
date_station_channel_idx | 811884 | 12031143199 |
1192412952
19847 | 19875 | public | data |
station_channel_epoch_idx | 8 | 318506 |
318044
19847 | 19876 | public | data |
station_data_idx | 9072 | 9734 |
1235
19847 | 19877 | public | data |
station_date_idx | 721616 | 10927533403
| 10908912092
19847 | 20479 | public | data |
data_station_channel_idx | 47293 | 194422257262 |
6338753379
(6 rows)
so they *have* been used (although not the station_data_idx
so much), but this doesn’t tell me when it was last used, so
some of those may be queries I was experimenting with to see
what was fastest, but are no longer in use. Maybe I should
keep an eye on this for a while, see which values are
increasing.
But my bet is you’re headed for partitioning
on datetime or perhaps station.
While datetime partitioning seems to be the most common,
I’m not clear on how that would help here, as the most
intensive queries need *all* the datetimes for a given
station, and even the smaller queries would be getting an
arbitrary time range potentially spanning several, if not all,
partitions. Now portioning on station seems to make sense -
there are over 100 of those, and pretty much any query will
only deal with a single station at a time. Perhaps if more
partitioning would be better, portion by both station and
channel? The queries that need to be fastest will only be
looking at a single channel of a single station.
I’ll look into this a bit more, maybe try some
experimenting while I still have *relatively* little data. My
main hesitation here is that in the brief look I’ve given
partitioning so far, it looks to be a royal pain to get set
up. Any tips for making that easier?