Re: Advice for using integer arrays?

Arthur Silva <arthurprs@xxxxxxxxx> · Tue, 6 Jan 2015 15:48:00 -0200

On Jan 6, 2015 3:12 PM, "Michael Heaney" <mheaney@xxxxxxxx> wrote:

>

> I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution.  But I'd like to hear from the experts before proceeding down this path.

>

> Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples.

>

> So I've created the following table to handle the many-to-many relationship:

>

>  create table sample_gene (id serial, sample_id int, gene_id int);

>

> which looks like this when populated:

>

> sample_id    |    gene_id

> ---------------------------------------

> 1                 |      1

> 1                 |      2

> ...

> 1                 |    30475

> 2                 |     1

> 2                 |     2

> ...

> 2                 |    29973

> 3                 |      1

> etc.

>

> The table now contains hundreds of millions of rows (with many, many more to come).  Join performance between samples and genes is quite slow, even with indexes on sample_id and gene_id.

>

> So it occurred to me: why not eliminate all the duplicate sample_id values by storing the gene_id's in an array, like so:

>

>  create table sample_gene_array (id serial, sample_id int, gene_id int [] );

>

> So now the table data looks like this:

>

> sample_id    |    gene_id []

> ---------------------------------------

> 1                 |      [1:30475]

> 2                 |      [1:29973]

> etc.

>

> The new table is significantly smaller, and performance (using ANY[] ) is quite good.  Nevertheless, I'm uneasy.  I come from a Sybase ASE background, and so have no experience with arrays as datatypes. Is it okay to store 30K+ gene values in an array in the linking table (or maybe even in the sample table itself, thus eliminating the linking table)?  Should I unnest the gene_id's first, before using them to join to the gene table?

>

> TIA for any guidance you can provide.  Again, I'm a Postgres neophyte - but I'm in awe of the power and flexibility of this database, and wish that I'd started using it sooner.

>

> ------

> Michael Heaney

> JCVI

>

>

>

>

>

> -- 

> Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

> To make changes to your subscription:

> http://www.postgresql.org/mailpref/pgsql-general
Please provide sample queries so we can understand how you query the data.