Re: Finding common hstore key=>value pairs with hstore

Paul Norman <penorman@xxxxxxx> · Sun, 27 Jan 2013 04:18:28 -0800

> From: pgsql-general-owner@xxxxxxxxxxxxxx [mailto:pgsql-general-
> owner@xxxxxxxxxxxxxx] On Behalf Of John R Pierce
> Subject: Re:  Finding common hstore key=>value pairs with
> hstore
> 
> On 1/27/2013 3:09 AM, Paul Norman wrote:
> > I am in a situation where I have two tables, a and b, each with a
> > hstore column called tags. Both tags columns have a GIN index on them.
> > I want to find rows of a and b where the both have a particular hstore
> > key and that key is the same.
> >
> > One way to do this would be SELECT * FROM a JOIN b ON a.tags -> 'foo'
> > = b.tags -> 'foo'; This would not use the indexes.
> >
> > I would like some way that makes use of the two indexes. ON a.tags @>
> > hstore('foo', b.tags -> 'foo') would be better, making use of the
> > a.tags index, but not the b.tags one. Is there any way to use both?
> >
> > I suppose the other problem is the hstore statistics gathering is
> > questionable enough that the query plan it comes up with could be
> > absolutely abysmal. This is of course not unique to this particular
> > problem but is a common problem with hstore columns
> 
> Seems to me like you should be using a separate field for this 'foo'
> defined as a foreign key.

I am cross-referencing two independent data sources. I had desired to get
into the full complexities of the data schema, but I'll summarize the
important points, leaving aside some of the postgis matter that doesn't
directly impact on the hstore issue. I'll also keep calling it foo to avoid
escaping the tag name, and because the name doesn't really matter.

I actually have three tables, but I'll just explain for one because the
others differ only in the nature of the geometries stored in them. All three
are kept updated with continually changing data though a complicated process

I have one table, "nodes", with OpenStreetMap node data. It has 3 columns
that matter for this, id, tags and geom. id is a bigint and primary key,
tags is a hstore and geom is a postgis geometry in EPSG:4326. I have a btree
index on id, gist on (geom,tags) and gin on tags. The table is 175GB, 1.75B
rows and the indexes are 39GB, 154GB and 24GB. I also have some other
indexes on other columns and some for testing purposes.

For the tags column, about 90% are empty hstores and about 1% have the key I
am interested in. There are approximately 370k unique values for this key.
For this data there is nothing special about the key I am interested in, it
is simply one of many.

The other table, import_foo is a table created by the software I am writing.
It also has three columns that matter, id, tags and geom. id is an int, but
otherwise the columns are the same type. The id from this table bears
absolutely no relation to the id from the nodes table. For my test dataset
this table is about 100k rows.

I am only interested in rows in the nodes table so all queries against this
table use ST_Intersects to filter to the region of interest. There are on
the order of 500k rows in the test region of which about 82k have the foo
key. 

The first thing my software does is delete rows from import_foo that it can
exactly match against with both the foo key and the bar key (another key,
similar statistics as foo) against the OSM data, as well as delete the
corresponding objects from its memory. This reduces import_foo down to
approximately 2k rows. This step is the first one where the situation in my
original message arises. It should be noted that not all rows of import_foo
will have bar keys.

The "easy" rows being dealt with, my program then goes on to consider more
complex matches on the remaining 2k rows. Here it matches them on the foo
key and the position (i.e. they are within N meters of each other), ignoring
the bar key. There may be multiple rows from nodes matching the same row
from import_foo. The behavior if multiple rows from import_foo match the
same row from nodes is currently undefined in the application, but it is
unlikely this particular combination will come up. This is another step that
the situation in my original message arises.

Given that there is nothing special about foo in nodes and that the two data
sources are entirely independent, I don't see that it makes sense to
consider a foreign key here.

This was probably excessively long, but explains my use case in moderate
detail. 

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general