Hi folks,
I am working on optimizing a query that attempts to efficiently paginate through a large table using multi-column "cursors" aka. the "seek method" (as described in detail here: https://use-the-index-luke.com/sql/partial-results/fetch-next-page).
The table (drastically simplified) looks like this:
CREATE TABLE data
(
col_1 int NOT NULL,
col_2 int NOT NULL,
col_3 int NOT NULL,
content varchar(10) NOT NULL
);
And has an appropriate index:
CREATE INDEX data_index ON data (col_1, col_2, col_3);
The recommended query to paginate through this table is using the "row values" syntax:
SELECT content
FROM data
WHERE (col_1, col_2, col_3) > (10, 20, 29)
ORDER BY col_1, col_2, col_3
LIMIT 100;
Which results in a perfectly optimized query plan (slightly edited for readability, using test data of 1 million rows, on PostgreSQL 17.2):
Limit (cost=0.42..5.33 rows=100 width=20)
(actual time=0.084..0.197 rows=100 loops=1)
-> Index Scan using data_index on data
(cost=0.42..43617.30 rows=889600 width=20)
(actual time=0.081..0.176 rows=100 loops=1)
Index Cond: (ROW(col_1, col_2, col_3) > ROW(10, 20, 29))
Planning Time: 0.344 ms
Execution Time: 0.264 ms
However, in reality, my query uses a mix of ascending and descending ordering (with an index matching the order columns), in which case the WHERE (col_1, col_2, col_3) > (10, 20, 29) syntax is not an option (unless I somehow derive "reversed" data from the column, which I would like to avoid).
Therefore I am using an equivalent query using multiple WHERE conditions, something like this (for simplicity, no mixed ordering is involved in the examples):
SELECT content
FROM data
WHERE
col_1 >= 10
AND (
col_1 > 10
OR (
col_2 >= 20
AND (
col_2 > 20
OR col_3 > 29
)
)
)
ORDER BY col_1, col_2, col_3
LIMIT 100;
Which returns the same rows, but the query plan is slightly less efficient:
Limit (cost=0.42..6.48 rows=100 width=20)
(actual time=0.848..0.893 rows=100 loops=1)
-> Index Scan using data_index on data
(cost=0.42..52984.52 rows=874874 width=20)
(actual time=0.847..0.884 rows=100 loops=1)
Index Cond: (col_1 >= 10)
Filter: ((col_1 > 10) OR ((col_2 >= 20) AND ((col_2 > 20) OR (col_3 > 29))))
Rows Removed by Filter: 2030
Planning Time: 0.133 ms
Execution Time: 0.916 ms
Instead of exclusively relying on an index access predicate, this plan involves an index filter predicate.
Without being familiar the internals of the query planner, I *think* there *should* be a way to come up with WHERE conditions that results in the "perfect" plan. There are several ways to phrase the conditions, of which I've tried a few, only to get the same or worse performance. Does anyone have a suggestion for a better query?
I am also aware that I might be chasing an optimization with low returns (yet to see how it performs in Real Life data) but I'm already too deep down in a rabbit hole without being able to turn back without knowing The Truth.
I've dumped my experiments into a DB Fiddle: https://www.db-fiddle.com/f/kd8zaibZGKGH1HSStyxNkx/0
Cheers,
Márton