Re: Values larger than 1/3 of a buffer page cannot be indexed (hstore)

Stefan Keller <sfkeller@xxxxxxxxx> · Sun, 1 May 2011 16:18:05 +0200

Hi Tom,

Thanks for the hint!  I'm actually doing a GROUP BY with an inquality
search (HAVING...) and still get long lasting queries, see query (2)
below.

I'm doing an equality search with success with the '->' operator on
the same field 'tags' like in this query (1):

-- Count all restaurants in database ("amenity = restaurant''):
select count(*) from osm_all_v
where hstore(tags)->'amenity'='restaurant'

This query 1 is reasonably fast and could be accelerated using this
functional index:
CREATE INDEX planet_osm_point_tags_restaurant
  ON planet_osm_point
  USING btree (tags)
  WHERE (tags -> 'amenity'::text) = 'restaurant'::text;

But look at query 2:

This GROUP BY query gives following results (after a VACUUM FULL ANALYZE):
* 26265ms without any index on field 'tags'
* 26000ms with GIN index on field 'tags'
* 26078ms with HASH index on field 'tags'

To sum up:
* GIN and HASH indexes don't make a difference.
* BTree index fails (as reported) on field 'tags' with ERROR 'index
row size 3120 exceeds maximum 2712' SQL state: 54000).
* GIST index fails too on field 'tags' but with ERROR 'invalid hstore
value found', SQL state: XX000.

Q:
=> The latter indicates an "invalid value" - but does not say what's invalid.
=> I have no clue how to boost query 2...? See below the query and the
explain output.

Yours, Stefan

QUERY 2
=======
-- Return all key-values of semantic type 'enum' without types
numeric, date/time etc.
-- (actually hstore handles all tag/values as of type text):
SELECT MIN(keys.key), hstore(p.tags)->keys.key, count(*)
FROM planet_osm_point p, (
  SELECT key, count(*)
  FROM (
    SELECT (each(tags)).key FROM planet_osm_point
  ) AS stat
GROUP BY key HAVING count(*) > 1
  AND key NOT LIKE 'note:%'
  ...
  AND key NOT IN
('ele','ref','url','website','email','maxspeed','converted_by', ... )
) keys
WHERE hstore(p.tags)->keys.key >''
GROUP BY hstore(p.tags)->keys.key HAVING count(*) > 1
ORDER by 1,3 desc
...
"amenity";"bus_stop";24414
"amenity";"restaurant";5423
"amenity";"bench";5041
"amenity";"parking";4232
"amenity";"fire_hydrant";2363
"amenity";"post_box";1838
"amenity";"fuel";1628
"amenity";"place_of_worship";1615
...

EXPLAIN output:
"Sort  (cost=6043108.58..6054613.93 rows=4602142 width=118) (actual
time=331415.987..331417.914 rows=964 loops=1)"
"  Sort Key: (min(keys.key)), (count(*))"
"  Sort Method:  quicksort  Memory: 85kB"
"  ->  GroupAggregate  (cost=4173505.61..4401220.42 rows=4602142
width=118) (actual time=330101.581..331409.834 rows=964 loops=1)"
"        Filter: (count(*) > 1)"
"        ->  Sort  (cost=4173505.61..4198338.94 rows=9933329
width=118) (actual time=330101.487..330870.237 rows=196227 loops=1)"
"              Sort Key: ((p.tags -> keys.key))"
"              Sort Method:  external merge  Disk: 21960kB"
"              ->  Nested Loop  (cost=22534.58..574521.77 rows=9933329
width=118) (actual time=6747.314..327192.811 rows=196227 loops=1)"
"                    Join Filter: ((p.tags -> keys.key) > ''::text)"
"                    ->  Seq Scan on planet_osm_point p
(cost=0.00..5649.22 rows=182822 width=86) (actual time=0.015..516.191
rows=182822 loops=1)"
"                    ->  Materialize  (cost=22534.58..22539.88
rows=163 width=32) (actual time=0.039..0.890 rows=420 loops=182822)"
"                          ->  Subquery Scan on keys
(cost=22534.58..22539.07 rows=163 width=32) (actual
time=6740.620..6743.358 rows=420 loops=1)"
"                                ->  HashAggregate
(cost=22534.58..22537.44 rows=163 width=32) (actual
time=6740.613..6741.651 rows=420 loops=1)"
"                                      Filter: (count(*) > 1)"
"                                      ->  Subquery Scan on stat
(cost=0.00..21417.62 rows=148929 width=32) (actual
time=0.093..6230.640 rows=196458 loops=1)"
"                                            Filter: ((stat.key !~~
'name%'::text) AND (stat.key !~~ 'addr:%'::text) AND (stat.key !~~*
'FIXME'::text) AND (stat.key !~~* 'openGeoDB:%'::text) AND (stat.key
!~~ 'note:%'::text) AND (stat.key !~~ '%_ref'::text) AND (stat.key !~~
'%description%'::text) AND (stat.key !~~ 'contact:%'::text) AND
(stat.key !~~ 'operator:%'::text) AND (stat.key !~~ 'uic_%'::text) AND
(stat.key !~~ 'TMC%'::text) AND (stat.key !~~ 'uic_%'::text) AND
(stat.key !~~ 'direction%'::text) AND (stat.key !~~ 'is_in%'::text)
AND (stat.key !~~ 'wikipedia%'::text) AND (stat.key <> ALL
('{ele,ref,url,website,email,maxspeed,converted_by,layer,level,phone,mobility:station_id,information,opening_hours,date,time,collection_times,operator,colour,fee,nat_name,alt_name,population,seats,postal_code,capacity,line,lines,maxheight,reg_name}'::text[])))"
"                                            ->  Seq Scan on
planet_osm_point  (cost=0.00..6106.28 rows=182822 width=86) (actual
time=0.025..2363.603 rows=575700 loops=1)"
"Total runtime: 331426.018 ms"

2011/5/1 Tom Lane <tgl@xxxxxxxxxxxxx>:
> Stefan Keller <sfkeller@xxxxxxxxx> writes:
>> Any ideas on how to index my hstore attribute?
>
> Use a GIST or GIN index. ÂThe only thing that a btree index on hstore
> can do for you is to support equality comparisons on the whole hstore
> value, which is pretty unlikely to be what you're after.
>
> Â Â Â Â Â Â Â Â Â Â Â Âregards, tom lane
>

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general