Re: Strange behavior of function date_trunc

Pavel Luzanov <p.luzanov@xxxxxxxxxxxxxx> · Thu, 6 May 2021 16:12:15 +0300

Hello,

On 05.05.2021 16:55, Tomas Vondra wrote:
Well, it'd not like date_trunc is executed for each row while now() is 
executed only once. The functions are executed for each row in both 
cases...
A couple more experiments.
Since I can't to track usage of system functions, I decided to play with 
user defined stable function.

In this case it is a wrapper for date_trunc:
CREATE OR REPLACE FUNCTION user_date_trunc(field text, source 
timestamptz) RETURNS timestamptz
AS $$
BEGIN
    RETURN date_trunc(field, source);
END;
$$ LANGUAGE plpgsql STABLE;

SET track_functions = 'all';
SELECT pg_stat_reset_single_function_counters('user_date_trunc'::regproc);

Let's create a test table:
CREATE TABLE t AS
    SELECT g.x, g.x::text AS y
    FROM generate_series('2021-01-01', '2021-06-01', '1 s'::interval) 
AS g(x) ORDER BY random();
VACUUM ANALYZE t;
SET random_page_cost = 1.1; -- for ssd disks

Previous query with user defined function:
EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= user_date_trunc('year', '2021-05-06'::timestamptz);
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..3520631.00 rows=13045175 width=31) (actual 
time=63.977..7157.852 rows=13046401 loops=1)
   Filter: (x >= user_date_trunc('year'::text, '2021-05-06 
00:00:00+03'::timestamp with time zone))
 Settings: random_page_cost = '1.1'
 Planning Time: 0.373 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, 
Deforming true
   Timing: Generation 0.404 ms, Inlining 53.122 ms, Optimization 7.100 
ms, Emission 3.600 ms, Total 64.227 ms
 Execution Time: 7502.564 ms

SELECT funcname, calls FROM pg_stat_user_functions WHERE funcid = 
'user_date_trunc'::regproc;
    funcname     |  calls
-----------------+----------
 user_date_trunc | 13046402

The function is executed once for each row in the table +1. As you 
explained earlier.
Ok.

But when there is an index on the column, the situation changes:
CREATE INDEX ON t(x);
ANALYZE t;
SELECT pg_stat_reset_single_function_counters('user_date_trunc'::regproc);

EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= user_date_trunc('year', '2021-05-06'::timestamptz);
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using t_x_idx on t  (cost=0.69..373188.48 rows=13046480 
width=31) (actual time=3.163..26476.012 rows=13046401 loops=1)
   Index Cond: (x >= user_date_trunc('year'::text, '2021-05-06 
00:00:00+03'::timestamp with time zone))
 Settings: random_page_cost = '1.1'
 Planning Time: 0.294 ms
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, 
Deforming true
   Timing: Generation 0.551 ms, Inlining 0.000 ms, Optimization 0.380 
ms, Emission 2.587 ms, Total 3.518 ms
 Execution Time: 26787.377 ms

SELECT funcname, calls FROM pg_stat_user_functions WHERE funcid = 
'user_date_trunc'::regproc;
    funcname     | calls
-----------------+-------
 user_date_trunc |     2

Only two times!
Does having an index allow the function value to be cached?

By the way, an index is used to access the table, although seq scan are 
much better.

Now with scalar subquery:
EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= (SELECT user_date_trunc('year', '2021-05-06'::timestamptz));
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using t_x_idx on t  (cost=0.70..194745.67 rows=4348827 
width=31) (actual time=3.000..26543.395 rows=13046401 loops=1)
   Index Cond: (x >= $0)
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.26 rows=1 width=8) (actual 
time=0.030..0.030 rows=1 loops=1)
 Settings: random_page_cost = '1.1'
 Planning Time: 0.071 ms
 JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, 
Deforming true
   Timing: Generation 0.671 ms, Inlining 0.000 ms, Optimization 0.188 
ms, Emission 2.583 ms, Total 3.441 ms
 Execution Time: 26875.105 ms

SELECT funcname, calls FROM pg_stat_user_functions WHERE funcid = 
'user_date_trunc'::regproc;
    funcname     | calls
-----------------+-------
 user_date_trunc |     3

The function is executed once, but the index is still in use.

The second question. What is the reason for choosing an index scan?

Seq scan with constant value only:
EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= '2021-01-01'::timestamptz;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..259011.00 rows=13046480 width=31) (actual 
time=2.135..913.628 rows=13046401 loops=1)
   Filter: (x >= '2021-01-01 00:00:00+03'::timestamp with time zone)
 Settings: random_page_cost = '1.1'
 Planning Time: 0.140 ms
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, 
Deforming true
   Timing: Generation 0.392 ms, Inlining 0.000 ms, Optimization 0.163 
ms, Emission 1.837 ms, Total 2.391 ms
 Execution Time: 1195.985 ms

The statistics on t.x shows that the condition in the query is not 
selective and seq scan are preferred over index scan.
SELECT n_distinct, (histogram_bounds::text::text[])[1]
FROM pg_stats WHERE tablename = 't' AND attname = 'x';
 n_distinct |    histogram_bounds
------------+------------------------
         -1 | 2021-01-01 00:01:10+03

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company