Re: How to properly query lots of rows based on timestamps?

Tom Lane <tgl@xxxxxxxxxxxxx> · Sat, 29 Aug 2020 11:19:53 -0400

=?utf-8?Q?Thorsten_Sch=C3=B6ning?= <tschoening@xxxxxxxxxx> writes:
> I have a table containing around 95 million rows, pretty much only
> storing a timestamp and further IDs of related tables containing the
> actual data in the end.

>> CREATE TABLE clt_rec
>> ( 
>> id BIGSERIAL NOT NULL, 
>> oms_rec     BIGINT NOT NULL, 
>> captured_at TIMESTAMP(6) WITH TIME ZONE NOT NULL, 
>> rssi        SMALLINT NOT NULL, 
>> CONSTRAINT pk_clt_rec PRIMARY KEY (id), 
>> CONSTRAINT fk_clt_rec_oms_rec FOREIGN KEY (oms_rec) REFERENCES "oms_rec" ("id"), 
>> CONSTRAINT uk_clt_rec_oms_rec UNIQUE (oms_rec) 
>> );

TBH, this seems like a pretty awful data design.  If you included the
timestamp column into oms_rec, and had an index on it, then you would
not need a join at all.

> Postgres seems to properly use available indexes, parallel workers and
> stuff like that. But looking at the actual times and compared to all
> the other parts of the query, comparing those timestamps simply takes
> the most time.

Timestamp comparison reduces to comparison of int64's, so it's
hard to imagine that anything could be much cheaper.  The part of
your plan that is actually eating the most cycles is the repeated
index probes into oms_rec:

>>     ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.39 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=1003394)

.002 * 1003394 = 2006.788, which of course can't be because the
whole query took 911 ms; but there's not much accuracy in this
per-loop measurement.  In any case, the 155 msec spent scanning
clt_rec would be entirely unnecessary if the timestamp existed in
oms_rec.  We can also bet that the index traversal costs would be
quite a bit less: in this query, we are re-descending pk_oms_rec
from the root, 1003394 times, which is a lot of cycles that wouldn't
be needed with a single scan on a timestamp index.

			regards, tom lane