Re: Advice for using integer arrays?

Rob Sargent <robjsargent@xxxxxxxxx> · Tue, 06 Jan 2015 10:57:32 -0700



    On 01/06/2015 10:09 AM, Michael Heaney
      wrote:

    
    I'm
      fairly new to Postgres, and have a design issue for which an array
      of integers might be a good solution.  But I'd like to hear from
      the experts before proceeding down this path.
      

      Essentially, I'm trying to model the relationship between a group
      of biological samples and their genes. Each sample (there are ~10K
      of them now, with more coming) will have about 30,000 genes.
      Conversely, a particular gene may be present in almost all
      samples.
      

      So I've created the following table to handle the many-to-many
      relationship:
      

       create table sample_gene (id serial, sample_id int, gene_id int);
      

      which looks like this when populated:
      

      sample_id    |    gene_id
      

      ---------------------------------------
      

      1                 |      1
      

      1                 |      2
      

      ...
      

      1                 |    30475
      

      2                 |     1
      

      2                 |     2
      

      ...
      

      2                 |    29973
      

      3                 |      1
      

      etc.
      

      The table now contains hundreds of millions of rows (with many,
      many more to come).  Join performance between samples and genes is
      quite slow, even with indexes on sample_id and gene_id.
      

      So it occurred to me: why not eliminate all the duplicate
      sample_id values by storing the gene_id's in an array, like so:
      

       create table sample_gene_array (id serial, sample_id int, gene_id
      int [] );
      

      So now the table data looks like this:
      

      sample_id    |    gene_id []
      

      ---------------------------------------
      

      1                 |      [1:30475]
      

      2                 |      [1:29973]
      

      etc.
      

      The new table is significantly smaller, and performance (using
      ANY[] ) is quite good.  Nevertheless, I'm uneasy.  I come from a
      Sybase ASE background, and so have no experience with arrays as
      datatypes. Is it okay to store 30K+ gene values in an array in the
      linking table (or maybe even in the sample table itself, thus
      eliminating the linking table)?  Should I unnest the gene_id's
      first, before using them to join to the gene table?
      

      TIA for any guidance you can provide.  Again, I'm a Postgres
      neophyte - but I'm in awe of the power and flexibility of this
      database, and wish that I'd started using it sooner.
      

      ------
      

      Michael Heaney
      

      JCVI
      

    Think
        genotype calls (sample, marker, call) and this just explodes in
        terms of number of rows but I'm trying to not let that bother
        me.  Partion by chromosome, or chromosome arm (even if partition
        is just separate tables) and the magnitude becomes manageable.
        At least as manageable as multi-gigabyte gvcf files per sample
        from GATK.

        
        But I'm a little confused: all samples for a given species will
        have the same genes (roughly).  Are you storing gene variants
        (sequence or otherwise) per sample?