expected O^2 looks line K^O, index problem not involved: [was] looping simpler query just faster

Ivan Sergio Borgonovo <mail@xxxxxxxxxxxxxxx> · Thu, 10 Jul 2008 14:19:30 +0200

On Thu, 10 Jul 2008 11:50:01 +0200
Martijn van Oosterhout <kleptog@xxxxxxxxx> wrote:

> On Thu, Jul 10, 2008 at 11:40:40AM +0200, Ivan Sergio Borgonovo
> wrote:
> > I've this:
> 
> What's basically killing you is this condition:
> >   select i2.ItemID from catalog_items i2
> >     inner join catalog_brands b2 on
> > upper(i2.brands)=upper(b2.name) where i1.brands=i2.brands
> > <********* and i2.dataPub>(now() - interval '8 month') and
> 
> Is not indexable. Hence the seqscan, which makes everything slow.
> In your "faster" version you test against a condition which *is*
> indexable, hence it's faster.

I changed to 

where upper(i1.brands)=upper(i2.brands)

"Nested Loop  (cost=0.00..1393962681.78 rows=74378 width=82)"
"  ->  Seq Scan on catalog_brands b1  (cost=0.00..1.44 rows=44 width=18)"
"  ->  Index Scan using catalog_items_brands_index on catalog_items i1  (cost=0.00..31680940.43 rows=1690 width=82)"
"        Index Cond: (upper((i1.brands)::text) = upper(("outer".name)::text))"
"        Filter: (subplan)"
"        SubPlan"
"          ->  Limit  (cost=9366.40..9366.41 rows=1 width=16)"
"                ->  Sort  (cost=9366.40..9366.41 rows=1 width=16)"
"                      Sort Key: i2.datainserimento"
"                      ->  Nested Loop  (cost=29.84..9366.39 rows=1 width=16)"
"                            ->  Bitmap Heap Scan on catalog_items i2  (cost=29.84..9364.61 rows=1 width=34)"
"                                  Recheck Cond: (upper(($0)::text) = upper((brands)::text))"
"                                  Filter: ((datapub > (now() - '8 mons'::interval)) AND (datainserimento > (now() - '6 mons'::interval)))"
"                                  ->  Bitmap Index Scan on catalog_items_brands_index  (cost=0.00..29.84 rows=3381 width=0)"
"                                        Index Cond: (upper(($0)::text) = upper((brands)::text))"
"                            ->  Seq Scan on catalog_brands b2  (cost=0.00..1.77 rows=1 width=18)"
"                                  Filter: (upper(($0)::text) = upper((name)::text))"

but it still perform badly.
Skipping one of the two
join catalog_brands b1 on upper(i1.brands)=upper(b1.name)
doesn't improve anything...
even skipping some conditions, that I thought would actually make
the query faster, restricting the rows to sort etc... 

and i2.dataPub>(now() - interval '8 month') and
i2.datainserimento>(now() - interval '6 month')

didn't improve the speed.

And the sum of times it takes to execute the simpler statement for
each brands even without waiting the end of the above statements is
at least 1 order of magnitude faster than the more complicated query.

catalog_brands is a quite small table so
->  Seq Scan on catalog_brands b2 (cost=0.00..1.77 rows=1 width=18)"
 Filter: (upper(($0)::text) = upper((name)::text))"

shouldn't be a problem

and it seems that even the index is not playing such a big part

since this that doesn't use the index
select name, brands from catalog_items where brands='CAMBRIDGE
UNIVERSITY PRESS' order by datainserimento desc limit 3

takes less than 1 sec.

I'd say that having 44 groups and since the largest takes always
less then 1 sec with the simpler query... there should be something
else wrong with the above query that takes > 3 min.

Infact:
create temp table groupeditor as select i1.ItemID, i1.brands, i1.name, i1.dataPub, i1.datainserimento
  from catalog_items i1
  inner join catalog_brands b1 on upper(i1.brands)=upper(b1.name);
create index groupeditor_brands_idx on groupeditor (brands);
create index groupeditor_ItemID_idx on groupeditor (ItemID);

explain select i1.brands, i1.name, i1.dataPub, i1.datainserimento
  from groupeditor i1 where
  i1.ItemID in (
  select i2.ItemID from groupeditor i2
    where
    i1.brands=i2.brands
    and i2.dataPub>(now() - interval '8 month') and i2.datainserimento>(now() - interval '6 month')
    order by i2.datainserimento desc
  limit 3);

"Seq Scan on groupeditor i1  (cost=0.00..197133363.99 rows=68583 width=1048)"
"  Filter: (subplan)"
"  SubPlan"
"    ->  Limit  (cost=1437.15..1437.16 rows=3 width=16)"
"          ->  Sort  (cost=1437.15..1437.34 rows=76 width=16)"
"                Sort Key: datainserimento"
"                ->  Bitmap Heap Scan on groupeditor i2  (cost=7.40..1434.78 rows=76 width=16)"
"                      Recheck Cond: (($0)::text = (brands)::text)"
"                      Filter: ((datapub > (now() - '8 mons'::interval)) AND (datainserimento > (now() - '6 mons'::interval)))"
"                      ->  Bitmap Index Scan on groupeditor_brands_idx  (cost=0.00..7.40 rows=686 width=0)"
"                            Index Cond: (($0)::text = (brands)::text)"

Creating the temp table takes up 3 sec, creating the indexes 3
sec... and the new query... still forever...

Killing
i2.dataPub>(now() - interval '8 month') and
i2.datainserimento>(now() - interval '6 month')
and moving it in the creation of the temp table made the above run
in... 5 sec roughly... what took most was table and indexes
creation... the query took 61ms.
I could follow a more scientific measurement method (cache etc...)
but still it looks pretty impressive...

even more... dropping the indexes on the temp table even after
restarting the server still make the whole process stay below 5 sec.

As soon as groupeditor get larger (increasing the considered
interval) the query get slower and slower... much more than O^2 and
surely still not as fast as it would be to use several simpler
statements.
It still looks like a good idea to create a temp table so I'll have
to sort over a smaller set... but still I'm puzzled.
The subquery technique still perform awfully compared to the sums of
times taken by simpler queries.

Debian stable, pg 8.1

BTW what's going to happen to the indexes related to a temp table?

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it