Re: How to properly query lots of rows based on timestamps?

Alban Hertroys <haramrae@xxxxxxxxx> · Sat, 29 Aug 2020 11:29:07 +0200

> On 29 Aug 2020, at 10:24, Thorsten Schöning <tschoening@xxxxxxxxxx> wrote:
> 
> Hi all,
> 
> I have a table containing around 95 million rows, pretty much only
> storing a timestamp and further IDs of related tables containing the
> actual data in the end.
> 
>> CREATE TABLE clt_rec
>> ( 
>>  id BIGSERIAL NOT NULL, 
>>  oms_rec     BIGINT NOT NULL, 
>>  captured_at TIMESTAMP(6) WITH TIME ZONE NOT NULL, 
>>  rssi        SMALLINT NOT NULL, 
>>  CONSTRAINT pk_clt_rec PRIMARY KEY (id), 
>>  CONSTRAINT fk_clt_rec_oms_rec FOREIGN KEY (oms_rec) REFERENCES "oms_rec" ("id"), 
>>  CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec) 
>> );
> 
> In many use cases I need to search all of those rows based on their
> timestamp to find rows arbitrary in the past: Sometimes it's only 15
> minutes into the past, sometimes it's 2 years, sometimes it's finding
> the first day of each month over 15 months for some of those telegrams
> etc. In the end, I pretty often need to compare those timestamps and
> some queries simply take multiple seconds in the end, especially
> adding up if multiple, but slightly different queries need to be
> executed one after another. The following are two abstracts of
> Postgres' query plans:
> 
> Plan 1:
> 
>> ->  Nested Loop  (cost=1.14..343169.49 rows=43543 width=20) (actual time=0.313..113.974 rows=34266 loops=3)
>>    ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..3437.90 rows=43543 width=24) (actual time=0.153..20.192 rows=34266 loops=3)
>>        Index Cond: ((captured_at >= ('2020-08-01 00:00:00+02'::timestamp with time zone - '00:00:00'::interval)) AND (captured_at <= ('2020-08-01 00:00:00+02'::timestamp with time zone + '1 day'::interval)))
>>    ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.80 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=102799)
>>        Index Cond: (id = clt_rec.oms_rec)

What happens here is that the planner looks up the lower and upper boundaries, everything in between those index nodes is a candidate record. Next, it loops over those to match the other condition of your query (id = clt_rec.oms_rec). You didn’t tell whether there’s an index on that column.

You’d probably see a performance improvement were you to create an index on (captured_at, id). If your Postgres version is somewhat recent, that could even lead to an Index Only Scan.

> Plan 2:
> 
>> ->  Nested Loop  (cost=1.14..836381.50 rows=111934 width=20) (actual time=0.379..911.697 rows=334465 loops=3)
>>    ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..8819.57 rows=111934 width=24) (actual time=0.193..154.176 rows=334465 loops=3)
>>        Index Cond: ((captured_at >= ('2020-08-28 10:21:06.968+02'::timestamp with time zone - '14 days'::interval)) AND (captured_at <= ('2020-08-28 10:21:06.968+02'::timestamp with time zone + '00:00:00'::interval)))
>>    ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.39 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=1003394)
>>        Index Cond: (id = clt_rec.oms_rec)

And this situation is very much the same issue, apart from the larger number of candidate records.

> Postgres seems to properly use available indexes, parallel workers and
> stuff like that. But looking at the actual times and compared to all
> the other parts of the query, comparing those timestamps simply takes
> the most time.

It only needs to compare 2 timestamps.

> I've looked into this topic and found statements about that one
> shouldn't put too many rows into the index[1] and stuff like that or
> it will be ignored at all. But that doesn't seem to be the case for me
> according to the plan. OTOH, my index really simply is about the
> column containing the timestamp, no function reducing things to dates
> or stuff like that to reduce the number of rows.
> 
>> CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at );

Try this:
CREATE INDEX idx_clt_rec_captured_at ON clt_rec USING btree ( captured_at, id );

Alban Hertroys
--
There is always an exception to always.