Efficient pagination using multi-column cursors

large.goose2829@xxxxxxxxxxxxx · Wed, 26 Feb 2025 15:27:53 +0100

Hi folks,

I am working on optimizing a query that attempts to efficiently paginate through a large table using multi-column "cursors" aka. the "seek method" (as described in detail here: https://use-the-index-luke.com/sql/partial-results/fetch-next-page).

The table (drastically simplified) looks like this:

CREATE TABLE data
(
    col_1   int         NOT NULL,
    col_2   int         NOT NULL,
    col_3   int         NOT NULL,
    content varchar(10) NOT NULL
);

And has an appropriate index:

CREATE INDEX data_index ON data (col_1, col_2, col_3);

The recommended query to paginate through this table is using the "row values" syntax:

SELECT content
FROM data
WHERE (col_1, col_2, col_3) > (10, 20, 29)
ORDER BY col_1, col_2, col_3
LIMIT 100;

Which results in a perfectly optimized query plan (slightly edited for readability, using test data of 1 million rows, on PostgreSQL 17.2):

Limit  (cost=0.42..5.33 rows=100 width=20)
       (actual time=0.084..0.197 rows=100 loops=1)
  ->  Index Scan using data_index on data  
        (cost=0.42..43617.30 rows=889600 width=20)
        (actual time=0.081..0.176 rows=100 loops=1)
        Index Cond: (ROW(col_1, col_2, col_3) > ROW(10, 20, 29))
Planning Time: 0.344 ms
Execution Time: 0.264 ms

However, in reality, my query uses a mix of ascending and descending ordering (with an index matching the order columns), in which case the WHERE (col_1, col_2, col_3) > (10, 20, 29) syntax is not an option (unless I somehow derive "reversed" data from the column, which I would like to avoid).

Therefore I am using an equivalent query using multiple WHERE conditions, something like this (for simplicity, no mixed ordering is involved in the examples):

SELECT content
FROM data
WHERE
  col_1 >= 10
  AND (
    col_1 > 10
    OR (
      col_2 >= 20
      AND (
        col_2 > 20
        OR col_3 > 29
      )
    )
  )
ORDER BY col_1, col_2, col_3
LIMIT 100;

Which returns the same rows, but the query plan is slightly less efficient:

Limit  (cost=0.42..6.48 rows=100 width=20) 
       (actual time=0.848..0.893 rows=100 loops=1)
  ->  Index Scan using data_index on data  
        (cost=0.42..52984.52 rows=874874 width=20)
        (actual time=0.847..0.884 rows=100 loops=1)
        Index Cond: (col_1 >= 10)
        Filter: ((col_1 > 10) OR ((col_2 >= 20) AND ((col_2 > 20) OR (col_3 > 29))))
        Rows Removed by Filter: 2030
Planning Time: 0.133 ms
Execution Time: 0.916 ms

Instead of exclusively relying on an index access predicate, this plan involves an index filter predicate.

Without being familiar the internals of the query planner, I *think* there *should* be a way to come up with WHERE conditions that results in the "perfect" plan. There are several ways to phrase the conditions, of which I've tried a few, only to get the same or worse performance. Does anyone have a suggestion for a better query?

I am also aware that I might be chasing an optimization with low returns (yet to see how it performs in Real Life data) but I'm already too deep down in a rabbit hole without being able to turn back without knowing The Truth.

I've dumped my experiments into a DB Fiddle: https://www.db-fiddle.com/f/kd8zaibZGKGH1HSStyxNkx/0

Cheers,
Márton