Re: Growth planning

Israel Brewster <ijbrewster@xxxxxxxxxx> · Tue, 5 Oct 2021 15:03:54 -0800

Just a quick update to this topic from my testing: I whipped up a quick python script to create the partition tables for me, which went smoothly enough, and created a table LIST partitioned on station. Once populated with my current data, this table proved marginally faster than the unpartitioned table, especially on the initial select (~4 seconds vs ~6 seconds, speeding up to around 2.4 seconds on subsequent queries). Of course, it is entirely possible that performance will remain higher than with the unpartitioned table as the column count grows. 
Then I tried partitioning by station, with the station tables sub-partitioned by channel, on the logic that most queries (especially the ones that need to be fast) are only interested in a single channel on a single station. This made a HUGE improvement (relatively speaking). Initial query time dropped to ~2.5 seconds, with subsequent queries coming in at closer to 1 second!

I’ll have to think about the maintenance side - what happens if a new station/channel comes online that I don’t have a partition for? I’m thinking try to catch the error in my python code when I try to insert such a record, create the relevant table(s), then try the INSERT again, but I’ll have to investigate more to figure out if this is an option (what sort of error do I get), or if there is a better one.

I guess time will tell if this is a sustainable/good schema, but at least for the 1,171,575,191 rows I currently have, this gives much better performance than the non-partitioned table, and presumably will continue to do so into the future.

Of course, if anyone else has any other suggestions other than simple partitioning, I’m all ears! Like I said, this is far outside my experience in terms of sheer data size (I will be talking to the timescaledb people tomorrow)!

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

On Oct 4, 2021, at 12:46 PM, Ron <ronljohnsonjr@xxxxxxxxx> wrote:

    On 10/4/21 12:36 PM, Israel Brewster wrote:

    [snip]

        Indeed. Table per station as opposed to partitioning? The
          *most* I can reasonably envision needing is to query two
          stations, i.e. I could see potentially wanting to compare
          station a to some “baseline” station b. In general, though,
          the stations are independent, and it seems unlikely that we
          will need any multi-station queries. Perhaps query one
          station, then a second query for a second to display graphs
          for both side-by-side to look for correlations or something,
          but nothing like that has been suggested at the moment.

    Postgresql partitions are tables.  What if you partition by
    station (or range of stations)?

    -- 

      Angular momentum makes the world go 'round.