Re: Why is my database so big?

Stephen Frost <sfrost@xxxxxxxxxxx> · Mon, 22 Feb 2016 09:58:08 -0500

All,

* FarjadFarid(ChkNet) (farjad.farid@xxxxxxxxxxxxxxxxx) wrote:
> Tom, thanks for your unbiased detailed response. 
> 
> Interesting post. 

Please don't top-post.  My comments are in-line, below.

> -----Original Message-----
> From: pgsql-general-owner@xxxxxxxxxxxxxx
> [mailto:pgsql-general-owner@xxxxxxxxxxxxxx] On Behalf Of Tom Lane
> Sent: 22 February 2016 05:06
> To: Andrew Smith
> Cc: pgsql-general@xxxxxxxxxxxxxx
> Subject: Re:  Why is my database so big?
> 
> Andrew Smith <laconical@xxxxxxxxx> writes:
> > I am setting up a proof of concept database to store some historical data.
> > Whilst I've used PostgreSQL a bit in the past this is the first time 
> > I've looked into disk usage due to the amount of data that could 
> > potentially be stored. I've done a quick test and I'm a little 
> > confused as to why it is occupying so much space on disk. Here is my table
> definition:
> 
> > CREATE TABLE "TestSize"
> > (
> >   "Id" integer NOT NULL,
> >   "Time" timestamp without time zone NOT NULL,
> >   "Value" real NOT NULL,
> >   "Status" smallint NOT NULL,
> >   PRIMARY KEY ("Id", "Time")
> > );
> 
> > CREATE INDEX test_index ON "TestSize" ("Id");

Note that you don't really need an index on "Id" because including a
primary key will automatically include an index on those fields, and an
index over ("Id", "Time") can be used to satisfy queries which have a
conditional on just the "Id" column.  Removing that extra index will
likely help with space issues.

> > With a completely empty table the database is 7 MB. After I insert 1 
> > million records into the table the database is 121 MB. My 
> > understanding is that each of the fields is sized as follows:
> 
> > integer - 4 bytes
> > timestamp without time zone - 8 bytes
> > real - 4 bytes
> > smallint - 2 bytes

I'd recommend against using timestamp w/o time zone.  For starters, as
noted, it's not actually saving you any space over timestamp w/ time
zone, and second, it makes working with that field painful and prone to
error.

> The long and the short of it is that Postgres is more oriented to OLTP-style
> applications where access to and update of individual rows is the key
> performance metric.  If you're dealing with historical data and mainly want
> aggregated query results, it's possible you'd get better performance and
> more-compact storage from a column-store database.
> 
> There's ongoing investigation into extending Postgres to support
> column-style storage for better support of applications like that; but any
> such feature is probably several years away, and it will not come without
> performance compromises of its own.

One approach to dealing with the PG per-row overhead is to use arrays
instead of rows, when you don't need that per-row visibility
information.  I've found that to be very successful for single-column
tables, but the technique may work reasonably well for other structures
also.

Thanks!

Stephen
Attachment:
signature.asc

Description: Digital signature