Re: planner chooses unoptimal plan on joins with complex key

"Dmitry Potapov" <fortune.fish@xxxxxxxxx> · Thu, 31 Jan 2008 14:40:11 +0300



2008/1/30, Tom Lane <tgl@xxxxxxxxxxxxx>:
> "Dmitry Potapov" <fortune.fish@xxxxxxxxx> writes:
> > 2008/1/25, Tom Lane <tgl@xxxxxxxxxxxxx>:
> >> It's hard to tell whether the planner is just being overoptimistic
> >> about the results of ANDing the three join conditions, or if one or
> >> more of the basic condition selectivities were misestimated.
>
> > I wonder if it will ever finish the work :( So, for now, the only
> > thing I can show is:
>
> > =# explain select 1 from t1, t2 where t1.m1=t2.m1 ;
> >                                    QUERY PLAN
> > --------------------------------------------------------------------------------
> >  Nested Loop  (cost=0.00..162742993234.25 rows=57462242912003 width=0)
> >    ->  Seq Scan on t2  (cost=0.00..690784.54 rows=30820054 width=4)
> >    ->  Index Scan using i_t1_m1 on t1  (cost=0.00..3080.74 rows=175973 width=4)
> >          Index Cond: (t1.m1 = t2.m1)
> > (4 rows)
>
> So this is estimating the join selectivity for m1 alone as
>
> regression=# select 57462242912003 / (30820054.0 * 21292688);
>         ?column?
> ------------------------
>  0.08756260792962293676
> (1 row)
>
> which would suggest that there are only 11 or so distinct values for m1;
> is that about right?
No, that's wrong. There's 155 distinct values in t1, and 157 in t2. By
the way, due to the nature of the data stored in m1 (see below), the
selectivity for m1, imho, should be bigger than 0.08756, something
about  0,80 or so. Unfortunally I can't be sure about that, the query
in question is still running, so the real rowcount is still not
available :(

> There's something a bit funny here --- looking at the plan components,
> the output row estimate should be about 30820054 * 175973, but it's ten
> times that.  I'm not sure if this is just a presentation artifact or
> if it points to a real bug.
That looks strange, indeed.

> Could you show us the pg_stats rows for t1.m1 and t2.m1?
I've attached them as text files to one of my previous letters. I'm
sure that email formatting will kill their readability. I attached
them once again to this letter, pgstats_t1_m,txt is for t1,
pgstats_t2_m.txt for t2.

To save your time, some fields:
t1.m1:
n_distinct=121
correlation=0.0910695

t2.m2:
n_distinct=111
correlation=0.148101

> The m2 estimate seems nearly dead on, and the m3 estimate is within 10%
> or so which is certainly close enough.  Barring the possibility that
> there's something wrong underneath that m1 estimate, the products of
> the selectivities do indeed work out to give us about 1 row out of the
> join with all three conditions.  So my conclusion is that these three
> values are not independent, but are highly correlated (in fact, it seems
> that m2 alone provides most of the join selectivity).  Do you actually
So, about the nature of the data in m1,m2,m3. m1 is something like
subsystem ID to which the row belongs, m3 is supposed to be unique
record id within that subsystem, but it starts with 0 when the system
is restarted. m2 is a timestamp in java format, it is there to
guarantee the uniqueness of (m1,m2,m3) triplet.
So, about corelation, m2 and m3 may seem to be correlated: if a system
inserts records at nearly constant rate, m2=a+b*m3..
> need to compare all three to make the join, or is one field perhaps
> dependent on the other two?  Can you refactor the data to make it so?
yes, I absolutely need to compare all three. about refactoring, the
only thing I can think out is to put m1,m2,m3 to a separate table,
with bigint primary key, and then to tie t1 and t2 with that key. but
that will require changes in the application and some amount of time
to convert the data :(


-- 
Regards,
            Dmitry

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend