Index ignored on column containing mostly 0 values

Leif Mortenson <leiflists@xxxxxxxxxxxxxxxxxx> · Tue, 31 Oct 2006 13:04:12 +0900

Hello,
I have been having a problem with the following query ignoring an index
on the foos.bar column.

SELECT c.id
FROM foos c, bars r
WHERE r.id != 0
AND r.modified_time > '2006-10-20 10:00:00.000'
AND r.modified_time <= '2006-10-30 15:20:00.000'
AND c.bar = r.id

The bars table contains 597 rows, while the foos table contains 5031203
rows.

After much research I figured out that the problem is being caused by the
PG planner deciding that my foos.bar index is not useful. The data in the
foos.bar column contains 5028698 0 values and 2505 that are ids in the bars
table.

Both tables have just been analyzed.

When I EXPLAIN ANALYZE the above query, I get the following:

"Hash Join (cost=3.06..201642.49 rows=25288 width=8) (actual
time=0.234..40025.514 rows=11 loops=1)"
" Hash Cond: ("outer".bar = "inner".id)"
" -> Seq Scan on foos c (cost=0.00..176225.03 rows=5032303 width=16)
(actual time=0.007..30838.623 rows=5031203 loops=1)"
" -> Hash (cost=3.06..3.06 rows=3 width=8) (actual time=0.117..0.117
rows=20 loops=1)"
" -> Index Scan using bars_index_modified_time on bars r
(cost=0.00..3.06 rows=3 width=8) (actual time=0.016..0.066 rows=20 loops=1)"
" Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
without time zone))"
" Filter: (id <> 0)"
"Total runtime: 40025.629 ms"

The solution I found was to change the statistics on my foos.bar column from
the default -1 to 1000. When I do this, reanalyze the table, and rerun
the above
query, I get the following expected result.

"Nested Loop (cost=0.00..25194.66 rows=25282 width=8) (actual
time=13.035..23.338 rows=11 loops=1)"
" -> Index Scan using bars_index_modified_time on bars r
(cost=0.00..3.06 rows=3 width=8) (actual time=0.063..0.115 rows=20 loops=1)"
" Index Cond: ((modified_time > '2006-10-20 10:00:00'::timestamp without
time zone) AND (modified_time <= '2006-10-30 15:20:00'::timestamp
without time zone))"
" Filter: (id <> 0)"
" -> Index Scan using foos_index_bar on foos c (cost=0.00..6824.95
rows=125780 width=16) (actual time=1.141..1.152 rows=1 loops=20)"
" Index Cond: (c.bar = "outer".id)"
"Total runtime: 23.446 ms"

Having to do this concerns me as I am not sure what a good statistics value
should be. Also we expect this table to grow much larger and I am concerned
that it may not continue to function correctly. I tried a value of 100
and that
works when the number of bars records is small, but as soon as I increase
them, the query starts ignoring the index again.

Is increasing the statistics value the best way to resolve this problem? How
can I best decide on a good statistics value?

Having a column containing large numbers of null or 0 values seems fairly
common. Is there way to tell Postgres to create an index of all values with
meaning. Ie all non-0 values? None that I could find.

Thanks in advance,
Leif