Re: Wrong rows selected with view

Bill Moseley <moseley@xxxxxxxx> · Thu, 17 Nov 2005 06:06:53 -0800

On Wed, Nov 16, 2005 at 11:34:21PM -0500, Tom Lane wrote:
> No, I'm saying that the underlying data (the join result before applying
> DISTINCT ON) looks like this:
> 
> bill=# select  "class".id, person.id AS person_id
> bill-#    FROM "class", "location", region, person, instructors
> bill-#   WHERE "class"."location" = "location".id AND "class".id = instructors."class" 
> bill-# AND instructors.person = person.id AND "location".region = region.id
> bill-#   ORDER BY "class".id;     
>   id  | person_id 
> ------+-----------
>     1 |        49
>     2 |        27
>     3 |        19
>     4 |        82
>     5 |        12
>  ...
>  1238 |        61
>  1238 |        60
>  1239 |        40
>  1240 |        67
>  1241 |        11
>  1243 |        26
>  1243 |        84
>  1244 |        26
>  1244 |        84
> (1311 rows)
> 
> The DISTINCT ON will take just one of the two rows with id = 1243, and
> just one of the rows with id = 1244, and *it is effectively random which
> one gets picked*.  So when you then select rows with person_id = 84, you
> may or may not see these rows in the end result.

Yikes!  The problem is *when* DISTINCT ON happens, right?

And, Tom,  you actually explained this to me on the list back on Aug 25th,
but that's when I was using the view in a different way.  You noted
that the order was "unpredictable" but at that time it didn't matter
which row was selected to me.

    http://archives.postgresql.org/pgsql-general/2005-08/msg01291.php

This current problem was due to my assumption of how PG executes the
query:

My assumption was that the select would first do the joins (including
limit by class.id = 84) *then* weed out the duplicate class.ids.

But if PG is first doing the the joins on all the tables (before
limiting by class.id = 84) and then weeding out the duplicate
class.ids, and then finally limiting by class.id = 84 then I can see
where I might end up wit the missing row.

Frankly, I expected the first to happen because it would use an index
to select just the records of class.id = 84, then do the joins on
that small set of records.  Didn't seem likely that the database would
join all the records first and then limit by class.id.  Seems like the
hard way to do the query.  But the query planner works in strange and
mysterious ways. ;)

Does that also explain why PG was sometimes returning the "correct"
number of rows?  Depending on which of the two query plans above
were used?

> Exactly.  So your view is going to return the class id along with a
> randomly selected one of the instructor ids.  It seems to me that
> filtering this result on instructor id is perhaps a bit ill-advised,
> even if you fix the view so that the chosen instructor id isn't so
> random (eg, you could fix it to display the lowest-numbered instructor
> id for the particular class).  Even then, are you searching for the
> instructor id that the view happens to show for that class, or some
> other one?

Well, clearly, my "one-size-fits-all view" doesn't work in this case.
I just need another view without distinct when limiting by instructor.

It was that red-herring of removing a seemingly random column from the
view that made it hard to see what was really happening.

Thanks very much for all your time.

-- 
Bill Moseley
moseley@xxxxxxxx

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match