On Mon, Aug 28, 2023 at 8:33 PM jayaprabhakar k <jayaprabhakar@xxxxxxxxx> wrote:
Hi,
TL;DR:
Observations:
- REINDEX requires a full table scan
- Roughly create a new index, rename index, drop old index.
- REINDEX is not incremental. running reindex frequently does not reduce the future reindex time.
- REINDEX does not use the index itself
- VACUUM does not clean up the indices. (relpages >> reltuples) I understand, vacuum is supposed to remove pages only if there are no live tuples in the page, but somehow, even immediately after vacuum, I see relpages significantly greater than reltuples. I would have assumed, relpages <= reltuples
- Query Planner does not consider index bloat, so uses highly bloated partial index that is terribly slow over other index
Your points 3 and 4 are not correct. empty index pages are put on a freelist for future reuse, they are not physically removed from the underlying index files. Maybe they are not actually getting put on the freelist or not being reused from the freelist for some reason, but that would be a different issue. Use the extension pgstattuple to see what its function pgstatindex says about the index.
The planner does take index bloat into consideration, but its effect size is low. Which it should be, as empty or irrelevant pages should be efficiently skipped during the course of most index operations. To figure out what is going with your queries, you should do an EXPLAIN (ANALYZE, BUFFERS) of them, but with it being slow and with it being fast.
Question: Is there a way to optimize postgres vacuum/reindex when using partial indexes?
Without knowing what is actually going wrong, I can only offer generalities. Make sure you don't have long-lived transactions which prevent efficient clean up. Increase the frequency on which vacuum runs on the table. It can't reduce the size of an already bloated index, but by keeping the freelist stocked it should be able prevent it from getting bloated in the first place. Also, it can remove empty pages from being linked into the index tree structure, which means they won't need to be scanned even though they are still in the file. It can also free up space inside non-empty pages for future reuse within that same page, and so that index tuples don't need to be chased down in the table only to be found to be not visible.
```
SELECT [columns list]FROM tasksWHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND updated < NOW() - interval '30 minutes'
```
Since we are only interested in the pending tasks, I created a partial index
`"tasks_pending_status_created_type_idx" btree (status, created, task_type) WHERE status <> ALL (ARRAY[3, 4, 5])`.
This looks like a poorly designed index. Since the status condition exactly matches the index where clause, there is no residual point in having "status" be the first column in the index, it can only get in the way (for this particular query). Move it to the end, or remove it altogether.
Within the tuples which pass the status check, which inequality is more selective, the "created" one or "updated" one?
Cheers,
Jeff