Re: Large rows number, and large objects

Craig James <craig_james@xxxxxxxxxxxxxx> · Sun, 19 Jun 2011 08:49:28 -0700



    On 6/19/11 4:37 AM, Samuel Gendler wrote:

    
      On Sat, Jun 18, 2011 at 9:06 PM, Jose
        Ildefonso Camargo Tolosa <ildefonso.camargo@xxxxxxxxx>
        wrote:

        
          Greetings,

          
          I have been thinking a lot about pgsql performance when it is
          dealing

          with tables with lots of rows on one table (several millions,
          maybe

          thousands of millions).  Say, the Large Object use case:

          
          one table has large objects (have a pointer to one object).

          The large object table stores the large object in 2000 bytes
          chunks

          (iirc), so, if we have something like 1TB of data stored in
          large

          objects, the large objects table would have something like
          550M rows,

          if we get to 8TB, we will have 4400M rows (or so).

          
          I have read at several places that huge tables should be
          partitioned,

          to improve performance.... now, my first question comes: does
          the

          large objects system automatically partitions itself? if no:
          will

          Large Objects system performance degrade as we add more data?
          (I guess

          it would).

        
    You should consider "partitioning" your data in a different way:
    Separate the relational/searchable data from the bulk data that is
    merely being stored.

    
    Relational databases are just that: relational.  The thing they do
    well is to store relationships between various objects, and they are
    very good at finding objects using relational queries and logical
    operators.

    
    But when it comes to storing bulk data, a relational database is no
    better than a file system.

    
    In our system, each "object" is represented by a big text object of
    a few kilobytes.  Searching that text file is essential useless --
    the only reason it's there is for visualization and to pass on to
    other applications.  So it's separated out into its own table, which
    only has the text record and a primary key.

    
    We then use other tables to hold extracted fields and computed data
    about the primary object, and the relationships between the
    objects.  That means we've effectively "partitioned" our data into
    searchable relational data and non-searchable bulk data.  The result
    is that we have around 50 GB of bulk data that's never searched, and
    about 1GB of relational, searchable data in a half-dozen other
    tables.

    
    With this approach, there's no need for table partitioning, and full
    table scans are quite reasonable.

    
    Craig