Re: Filtering before join with date_trunc()

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Mon, 15 Oct 2018 09:30:09 -0700

On 10/15/18 8:57 AM, Phil Endecott wrote:
Dear Experts,

I have a few tables with "raw" timestamsps like this:

+-------------------------------+----------+
|             time              | pressure |
+-------------------------------+----------+
| 2018-09-14 00:00:07.148378+00 |  1007.52 |
| 2018-09-14 00:10:07.147506+00 |  1007.43 |
| 2018-09-14 00:20:07.147533+00 |  1007.28 |
+-------------------------------+----------+

For each of these tables I have a view which rounds the timestamp
to the nearest minute, and ensures there is only one row per minute:

  SELECT date_trunc('minute'::text, tbl."time") AS "time",
     max(tbl.pressure) AS pressure
    FROM tbl
   GROUP BY (date_trunc('minute'::text, tbl."time"))
   ORDER BY (date_trunc('minute'::text, tbl."time"));

I then join these tables on the rounded time:

  SELECT COALESCE(rain."time", pressures."time", temperatures."time") AS "time",
     rain.rain,
     pressures.pressure,
     temperatures.temperature
    FROM rain
      FULL JOIN pressures USING ("time")
      FULL JOIN temperatures USING ("time");

+------------------------+------+----------+-------------+
|          time          | rain | pressure | temperature |
+------------------------+------+----------+-------------+
| 2018-09-14 00:00:00+00 |    0 |  1007.52 |      11.349 |
| 2018-09-14 00:10:00+00 |    0 |  1007.43 |     11.2317 |
| 2018-09-14 00:20:00+00 |    0 |  1007.28 |     11.2317 |
+------------------------+------+----------+-------------+

The COALESCE for time and the full joins are needed because some
columns may be missing for some minutes.

Now I'd like to find the values for a particular short time period:

SELECT * FROM readings
WHERE "time" BETWEEN '2018-10-01T00:00:00' AND '2018-10-01T24:00:00'

Is readings a table or view?

If view is the SELECT COALESCE ... query the view query?

This works, but it is inefficient; it seems to create all the rounded
data, do the join on all of it, and then filter on the time period.
Ideally it would filter the raw data, and then need to round and join
far fewer rows.

It would not be difficult for me to round the timestamps when inserting
the data, and also ensure that there is only one row for each minute.
But I've done some experiments and even if I remove all the rounding and
replace the full joins with regular joins, it still does sequential
scans on at least one of the tables:

Nested Loop  (cost=12.95..144.99 rows=135 width=20)
    Join Filter: (x_rain."time" = x_pressures."time")
    ->  Hash Join  (cost=12.67..97.83 rows=135 width=24)
          Hash Cond: (x_temperatures."time" = x_rain."time")
          ->  Seq Scan on x_temperatures  (cost=0.00..67.50 rows=4350 width=12)
          ->  Hash  (cost=10.98..10.98 rows=135 width=12)
                ->  Index Scan using x_rain_by_time on x_rain  (cost=0.28..10.98 rows=135 width=12)
                      Index Cond: (("time" >= '2018-10-01 00:00:00+00'::timestamp with time zone) AND ("time" <= '2018-10-02 00:00:00+00'::timestamp with time zone))
    ->  Index Scan using x_pressures_by_time on x_pressures  (cost=0.28..0.34 rows=1 width=12)
          Index Cond: ("time" = x_temperatures."time")

Maybe that is because the tables are currently relatively small (a
few thousands rows) and it believes that sequential scans are faster.
(I have sometimes wished for an "explain" variant that tells me what
query plan it would choose if all the tables were 100X larger.)

Is there anything I can do to make this more efficient when the tables
are larger?

Thanks for any suggestions.

Regards, Phil.

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx