Re: select subquery versus join subquery

"Gunnar \"Nick\" Bluth" <gunnar.bluth.extern@xxxxxxxxx> · Tue, 23 May 2017 13:03:37 +0200

Am 05/22/2017 um 09:57 PM schrieb Jeff Janes:
> I need to make a view which decorates rows from a parent table with
> aggregated values from a child table.  I can think of two ways to write
> it, one which aggregates the child table to make a new query table and
> joins the parent to that, as shown in "view1" below.  Or does subselect
> in the select list to aggregate just the currently matching rows, and
> returns that value, as in "view2" below.
> 
> While these two are semantically equivalent, the planner doesn't
> understand that, and always executes them pretty much the way you would
> naively do it based on the text of the query.
> 
> But view1 is pretty slow if the WHERE clause is highly selective (like
> "WHERE cutoff<0.00001") because it has to summarize the entire child
> table just to pull out a few rows.  But view2 is pretty slow if the
> entire view or most of it (like "WHERE cutoff<0.9") is being returned.
> 
> Is there some 3rd way to write the query which allows the planner to
> switch between strategies (summarize whole table vs summarize values on
> demand) depending on the known selectivity of the where clause?
> 
> In this case, the planner is getting the relative cost estimates roughly
> correct.  It is not a problem of mis-estimation.
> 
> I can always create two views, view_small and view_large, and swap
> between them based on my own knowledge of how restrictive a query is
> likely to be, but that is rather annoying.  Especially in the real-world
> situation, which is quite a bit more complex than this.
> 
> create table thing as select x as id, random() as cutoff from
> generate_series(1,2000000) f(x);
> 
> create table thing_alias as select
> floor(power(random()*power(2000000,5),0.2))::int thing_id, md5(x::text),
> random() as priority from generate_series(1,150000) f(x);
> 
> create index on thing_alias (thing_id );
> 
> create index on thing (cutoff );
> 
> vacuum; analyze;
> 
> create view view1 as select id, md5,cutoff from thing left join 
>   (
>      select distinct on (thing_id) thing_id, md5 from thing_alias 
>      order by thing_id, priority desc
>   ) as foo 
>   on (thing_id=id);
> 
> create view view2 as select id, 
>   (
>      select md5 from thing_alias where thing_id=id 
>         order by priority desc limit 1
>   ) as md5, 
>   cutoff from thing;
> 
> Cheers,
> 
> Jeff

Hi Jeff,

how does something like

CREATE OR REPLACE VIEW public.view3 AS
 SELECT thing.id,
    foo.md5,
    thing.cutoff
   FROM thing,
    LATERAL ( SELECT DISTINCT ON (thing_alias.thing_id)
thing_alias.thing_id,
            thing_alias.md5
           FROM thing_alias
          WHERE thing_alias.thing_id = thing.id
          ORDER BY thing_alias.thing_id, thing_alias.priority DESC) foo

work for you? At least that's always using an index scan here, as
opposed to view1, which (for me) defaults to a SeqScan on thing_alias at
a low cutoff.

*****
Note btw. that both view1 and view2 don't return any md5 values for me,
while view3 does!
*****

Results (ms, median of 3 runs):
cutoff<  0.1   0.9
view1:   348   1022
view2:   844   6484
view3:   842   5976

With

 LATERAL ( SELECT string_agg(thing_alias.md5, ','::text) AS md5
           FROM thing_alias
          WHERE thing_alias.thing_id = thing.id
          GROUP BY thing_alias.thing_id) foo

(which seems to make more sense ;-)

I yield 483 (0.1) and 3410 (0.9) ms (and return md5-Aggregates).

Cheers,
-- 
Gunnar "Nick" Bluth
DBA ELSTER

Tel:   +49 911/991-4665
Mobil: +49 172/8853339

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature