Just a quick update to this topic from my testing: I whipped up a quick python script to create the partition tables for me, which went smoothly enough, and created a table LIST partitioned on station. Once populated with my current data, this table proved marginally faster than the unpartitioned table, especially on the initial select (~4 seconds vs ~6 seconds, speeding up to around 2.4 seconds on subsequent queries). Of course, it is entirely possible that performance will remain higher than with the unpartitioned table as the column count grows. Then I tried partitioning by station, with the station tables sub-partitioned by channel, on the logic that most queries (especially the ones that need to be fast) are only interested in a single channel on a single station. This made a HUGE improvement (relatively speaking). Initial query time dropped to ~2.5 seconds, with subsequent queries coming in at closer to 1 second! I’ll have to think about the maintenance side - what happens if a new station/channel comes online that I don’t have a partition for? I’m thinking try to catch the error in my python code when I try to insert such a record, create the relevant table(s), then try the INSERT again, but I’ll have to investigate more to figure out if this is an option (what sort of error do I get), or if there is a better one. I guess time will tell if this is a sustainable/good schema, but at least for the 1,171,575,191 rows I currently have, this gives much better performance than the non-partitioned table, and presumably will continue to do so into the future. Of course, if anyone else has any other suggestions other than simple partitioning, I’m all ears! Like I said, this is far outside my experience in terms of sheer data size (I will be talking to the timescaledb people tomorrow)! Thanks again! --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145
|