Re: Growth planning

Israel Brewster <ijbrewster@xxxxxxxxxx> · Mon, 4 Oct 2021 09:36:32 -0800

On Oct 4, 2021, at 9:22 AM, Rob Sargent <robjsargent@xxxxxxxxx> wrote:

    On 10/4/21 11:09 AM, Israel Brewster
      wrote:

          On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsargent@xxxxxxxxx>
            wrote:

                  On Oct 4, 2021, at 10:22 AM, Israel
                    Brewster <ijbrewster@xxxxxxxxxx>
                    wrote:

                Guessing the “sd” is "standard deviation”?  Any chance
                those stddevs are easily calculable from base data?
                 Could cut your table size in half (and put those 20
                cores to work on the reporting).

        Possible - I’d have to dig into that with the script
          author. I was just handed an R script (I don’t work with R…)
          and told here’s the data it needs, here’s the output we need
          stored in the DB. I then spent just enough time with the
          script to figure out how to hook up the I/O. The schema is
          pretty much just a raw dump of the output - I haven’t really
          spent any resources figuring out what, exactly, the data is.
          Maybe I should :-)

                And I wonder if the last three indices are
                strictly necessary? They take disc space too.

        Not sure. Here’s the output from pg_stat_all_indexes:

        volcano_seismology=# select *
            from pg_stat_all_indexes where relname='data';

             relid | indexrelid | schemaname | relname |    
              indexrelname        | idx_scan | idx_tup_read |
            idx_tup_fetch 

-------+------------+------------+---------+---------------------------+----------+--------------+---------------

             19847 |      19869 | public     | data    | data_pkey      
                      |        0 |            0 |             0

             19847 |      19873 | public     | data    |
            date_station_channel_idx  |   811884 |  12031143199 | 
              1192412952

             19847 |      19875 | public     | data    |
            station_channel_epoch_idx |        8 |       318506 |     
              318044

             19847 |      19876 | public     | data    |
            station_data_idx          |     9072 |         9734 |       
              1235

             19847 |      19877 | public     | data    |
            station_date_idx          |   721616 |  10927533403
            |   10908912092

             19847 |      20479 | public     | data    |
            data_station_channel_idx  |    47293 | 194422257262 | 
              6338753379

            (6 rows)

        so they *have* been used (although not the station_data_idx
          so much), but this doesn’t tell me when it was last used, so
          some of those may be queries I was experimenting with to see
          what was fastest, but are no longer in use. Maybe I should
          keep an eye on this for a while, see which values are
          increasing.

              But my bet is you’re headed for partitioning
                on datetime or perhaps station.

        While datetime partitioning seems to be the most common,
          I’m not clear on how that would help here, as the most
          intensive queries need *all* the datetimes for a given
          station, and even the smaller queries would be getting an
          arbitrary time range potentially spanning several, if not all,
          partitions. Now portioning on station seems to make sense -
          there are over 100 of those, and pretty much any query will
          only deal with a single station at a time. Perhaps if more
          partitioning would be better, portion by both station and
          channel? The queries that need to be fastest will only be
          looking at a single channel of a single station.

        I’ll look into this a bit more, maybe try some
          experimenting while I still have *relatively* little data. My
          main hesitation here is that in the brief look I’ve given
          partitioning so far, it looks to be a royal pain to get set
          up. Any tips for making that easier?

    If no queries address multiple stations you could do a table per
    station.  Doesn't smell good but you have a lot of data and well,
    speed kills.

Indeed. Table per station as opposed to partitioning? The *most* I can reasonably envision needing is to query two stations, i.e. I could see potentially wanting to compare station a to some “baseline” station b. In general, though, the stations are independent, and it seems unlikely that we will need any multi-station queries. Perhaps query one station, then a second query for a second to display graphs for both side-by-side to look for correlations or something, but nothing like that has been suggested at the moment.

    I think the date-station-channel could "take over" for the
    station-date.  Naturally the latter is chosen if you give just the
    two fields, but I would be curious to see how well the former
    performs given just its first two fields(when station-date doesn't
    exist).

Ah, that makes sense. I’ll try to run some benchmarks later today/tomorrow.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145