Re: Unexpected query plan results

Robert Haas <robertmhaas@xxxxxxxxx> · Mon, 1 Jun 2009 18:58:40 -0400

On Mon, Jun 1, 2009 at 4:53 PM, Anne Rosset <arosset@xxxxxxxxxx> wrote:
>> On Mon, Jun 1, 2009 at 2:14 PM, Anne Rosset <arosset@xxxxxxxxxx> wrote:
>>> SELECT SUM(1) FROM item WHERE is_deleted = 'f'; sum --------- 1824592 (1
>>> row)
>>> SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
>>> </sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>'; sum --------
>>> 122412 (1 row)
>>> SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
>>> </sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>' AND is_deleted
>>> =
>>> 'f'; sum ----- 71 (1 row)
>>> SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641
>>> </sf/sfmain/do/go/tracker3641?returnUrlKey=1243878161701>' AND is_deleted
>>> =
>>> 't'; sum -------- 122341 (1 row)
>
> The item table has 2324829 rows

So 1824592/2324829 = 78.4% of the rows have is_deleted = false, and
0.06709% of the rows have the relevant folder_id.  Therefore the
planner assumes that there will be 2324829 * 78.4% * 0.06709% =~
96,000 rows that satisfy both criteria (the original explain had
97,000; there's some variability due to the fact that the analyze only
samples a random subset of pages), but the real number is 71, leading
it to make a very bad decision.  This is a classic "hidden
correlation" problem, where two columns are correlated but the planner
doesn't notice, and you get a terrible plan.

Unfortunately, I'm not aware of any real good solution to this
problem.  The two obvious approaches are multi-column statistics and
planner hints; PostgreSQL supports neither.  There are various
possible hacks that aren't very satisfying, such as:

1. Redesign the application to put the deleted records in a separate
table from the non-deleted records.  But if the deleted records still
have child records in other tables, this won't fly due to foreign key
problems.

2. Inserting a clause that the optimizer doesn't understand to fool it
into thinking that the scan on the item table is much more selective
than is exactly the case.  I think adding (item.id + 0) = (item.id +
0) to the WHERE clause will work; the planner will brilliantly
estimate the selectivity of that expression as one in 200.  The
problem with this is that it will likely lead to a better plan in this
particular case, but for other folder_ids it may make things worse.
There's also no guarantee that a future version of PostgreSQL won't be
smart enough to see through this type of sophistry, though I think
you're safe as far as the forthcoming 8.4 release is concerned.

3. A hack that makes me gag, but it actually seems to work...

CREATE OR REPLACE FUNCTION item_squash(varchar, boolean) RETURNS varchar[] AS $$
SELECT array[$1, CASE WHEN $2 THEN 'true' ELSE 'false' END]
$$ LANGUAGE sql IMMUTABLE;

CREATE INDEX item_squash_idx ON item (item_squash(folder_id, is_deleted));

...and then remove "folder_id = XXX AND is_deleted = YYY" from your
query and substitute "item_squash(folder_id, is_deleted) =
item_squash(XXX, YYY)".  The expresson index forces the planner to
gather statistics on the distribution of values for that expression,
and if you then write a query using that exact same expression the
planner can take advantage of it.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance