Re: Strategies for converting UTC data to local windows for arbitrary resolutions and timezones

Alban Hertroys <haramrae@xxxxxxxxx> · Wed, 4 Oct 2023 21:06:34 +0200

> On 4 Oct 2023, at 17:58, Lincoln Swaine-Moore <lswainemoore@xxxxxxxxx> wrote:
> 
> > SELECT
> >     sub.gs AS ts_in_utc
> >     ,sub.gs AT TIME ZONE 'America/New_York' AS ts_in_local_tz
> >     ,date_bin('1 days', sub.gs AT TIME ZONE 'America/New_York', 
> > '2023-01-01')
> > FROM (SELECT generate_series('2023-11-03 00:00:00Z'::timestamptz, 
> > '2023-11-07 00:00:00Z'::timestamptz, '5 minutes'::interval) AS gs) AS sub
> > WHERE
> >     sub.gs >= '2023-11-05 00:00:00 America/New_York'::timestamptz AND 
> > sub.gs < '2023-11-06 00:00:00 America/New_York'::timestamptz
> 
> I believe this query will be funky around DST borders, because `sub.gs AT TIME ZONE 'America/New_York'` will be localized in a way that erases the difference between hours with different offsets, which are genuinely different. For instance, I ran this and there are two rows within it that look like: 
> 
> ` 2023-11-05 05:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`
> and 
> ` 2023-11-05 06:00:00+00 | 2023-11-05 01:00:00 | 2023-11-05 00:00:00`
> 
> I think that the non-unique second column will pose an issue for the date binning at a resolution finer than 1 day.

What I do in such cases is to add an extra column with the UTC timestamp to serve as a linear scale to the local timestamps. That also helps with ordering buckets in reports and such during DST changes (especially the ones where an hour repeats).

Filtering in the queries occurs on the UTC scale, with the local timestamps calculated back to UTC, so that it doesn’t matter whether the local time has 23, 24, 25 or 24.5 or 23.5 or whatever number of hours on a date-range - it all maps back because UTC always has 24 hours.

Something that I also do is to create calendar tables and views for the buckets, with 2 timestamps per bucket: the start of the bucket and the start of the next bucket. That gives you a range to put actual timestamps between (not BETWEEN between, because that’s inclusive). You can store and index that, as opposed to generated results using generate_series - basically I materialise those.

For hours and quarter hours I found it to be fairly convenient to base a view on a join between a date calendar and an (quarter of an) hour per UTC day table, but materialising that with some indexes may perform better (at the cost of disk space). I do materialise that currently, but our database server doesn’t have a lot of memory so I’m often not hitting the cache and performance suffers a bit (infrastructure is about to change for the better though).

Regards,

Alban Hertroys
--
There is always an exception to always.