Thanks Justin.
The 4ms different in the examples isn't an accurate benchmark. I'm seeing about a ~20% difference over a larger sample size. And this is on a fork of the production database.
Apart from the end-performance, I'm motivated to figure out why one index results in an index scan whereas the other one does not.
I didn't mention this in my original email but I've separately tested dropping the
`state` index, running VACUUM FULL on the table, then recreating both indexes. The result was the same where querying on state produced an index scan whereas closed produced a bitmap scan.
Andrew's email and Michael's follow-up has me curious because it suggests I'm running into a issue specific to indexing on IS NULL, @Justin what do you think of this?
In the meantime Justin I'll investigate some more of your suggestions.
On Tue, Feb 19, 2019 at 9:37 PM Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
On Tue, Feb 19, 2019 at 05:10:43PM -0700, Abi Noda wrote:
> I have a table as defined below. The table contains 1,027,616 rows, 50,349
> of which have state='open' and closed IS NULL. Since closed IS NULL for all
> rows where state='open', I want to remove the unnecessary state column.
>
> CREATE TABLE tickets (
> id bigserial primary key,
> state character varying,
> closed timestamp,
...
> );
>
> CREATE INDEX "state_index" ON "tickets" ("state") WHERE ((state)::text =
> 'open'::text));
>
> As part of the process of removing the state column, I am trying to index
> the closed column so I can achieve equal query performance (index scan) as
> when I query on the state column as shown below:
>
> EXPLAIN ANALYZE select title, created, closed, updated from tickets where state = 'open';
> Index Scan using state_index on tickets (cost=0.29..23430.20 rows=50349 width=64) (actual time=17.221..52.110 rows=51533 loops=1)
>
> However, when I index the closed column, a bitmap scan is used instead of
> an index scan, with slightly slower performance. Why isn't an index scan
> being used, given that the exact same number of rows are at play as in my
> query on the state column? How do I index closed in a way where an index
> scan is used?
>
> CREATE INDEX closed_index ON tickets (id) WHERE closed IS NULL;
> EXPLAIN ANALYZE select title, created, closed, updated from tickets where closed IS NULL;
> Bitmap Heap Scan on tickets (cost=824.62..33955.85 rows=50349 width=64) (actual time=10.420..56.095 rows=51537 loops=1)
> -> Bitmap Index Scan on closed_index (cost=0.00..812.03 rows=50349 width=0) (actual time=6.005..6.005 rows=51537 loops=1)
Are you really concerned about 4ms ? If this is a toy-sized test system,
please try on something resembling production, perhaps by loading production or
fake data, or perhaps on a production system within a transactions (begin; CREATE
INDEX CONCURRENTLY; explain ...; rollback).
You can see that most of the estimated cost is from the table (the index scan
accounts for only 812 of total 33955 cost units). So I'm guessing the planner
thinks that an index scan will either 1) access the table randomly; and/or, 2)
access a large fraction of the table.
If it was just built, the first (partial/conditional/predicate/where) index
will scan table in its "physical" order (if not sequentially).
The 2nd index is going to scan table in order of ID, which I'm guessing is not
"correlated" with its physical order, so an index scan cost is computed as
accessing a larger fraction of the table (but by using an "bitmap" scan it's at
least in physical order). In fact: 50349/17478 = ~3 tuples/page is low, so
you're accessing a large fraction of the table to return a small fraction of
its tuples.
You can check what it thinks here:
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram
You could try CLUSTERing the table on ID (which requires a non-partial index)
and ANALYZEing (which might cause this and other queries to be planned and/or
perform differently). That causes the table to be locked exclusively. Then,
the planner knows that scanning index and returning results ordered by IDs
(which doesn't matter) will also access table in physical order (which
matters), and maybe fewer pages need to be read, too.
Justin