Using a GIN index on an integer array to model sets of tags

Mike Jarmy <mjarmy@xxxxxxxxx> · Sat, 17 Nov 2012 11:01:41 -0500

I am researching how to set up a schema for querying a set of tags
associated with an object.  There are approximately 100 distinct tags
in my application (these will be pre-populated), and I am expecting a
fairly low number of distinct sets of these tags  -- in other words, a
given set of tags will most likely be reused by many objects.  There
will be on the order of thousands to hundreds-of-thousands of records
in the 'objects' table.

I could use the standard many-to-many approach for this, that I've
used many times in the past:

CREATE TABLE TAGS (
    ID SMALLINT PRIMARY KEY,
    NAME VARCHAR);

CREATE TABLE OBJECTS (
    ID SERIAL PRIMARY KEY,
    FOO VARCHAR);

CREATE TABLE OBJECT_TAGS (
    OBJECT_ID INT,
    TAG_ID SMALLINT,
    PRIMARY KEY (OBJECT_ID, TAG_ID));

However, I've run across a couple of articles suggesting that I could
use an array field to store the tags on the object, and use a GIN
index to do set comparisons, e.g:
http://sjsnyder.com/using-postgresql-arrays-with-gin-indexes-for.
Though note that in that article they are storing each tag in the
'tags' field as text, whereas I think it would make more sense to
store the id of each tag in the 'tags' field, and cache the mapping
from tag names to tag ids in my application:

CREATE TABLE TAGS (
    ID SMALLINT PRIMARY KEY,
    NAME VARCHAR);

CREATE TABLE OBJECTS (
    ID SERIAL PRIMARY KEY,
    TAGS SMALLINT[],
    FOO VARCHAR);

CREATE INDEX OBJECTS_TAGS_INDEX ON OBJECTS USING GIN (TAGS);

I like this array-plus-gin-index approach because the object id is not
repeated over and over again as it is in the many-to-many link table.
It seems like it would be much more efficient.  Another advantage is
that the queries are so much more legible -- it just looks a lot
cleaner to me.

However, the PostgreSQL manual's chapter on arrays
(http://www.postgresql.org/docs/9.2/static/arrays.html) does not
mention GIN indexes.  Instead, it has a tip saying "Arrays are not
sets; searching for specific array elements can be a sign of database
misdesign." Of course the manual is correct if the array field is
unindexed, but I find the fact that the GIN-indexing approach is not
mentioned to be puzzling.

So, my question is: Is it in fact a good idea to use the
integer-array-plus-GIN-index approach as a way to store sets in-line?
Does anyone actually do this in production? Or is the old-school
many-to-many relationship still the way to go?  Of course I'm going to
try both approaches out myself and see how it works, but I'm
interested in the communities opinions on this subject (particularly
since I'm new to PostgreSQL, having mostly used commercial databases
in the past).

Thanks, Mike Jarmy

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general