Re: Two billion records ok?

Brent Wood <b.wood@xxxxxxxxxx> · Wed, 06 Sep 2006 08:39:32 +1200

Nick Bower wrote:
We're considering using Postgresql for storing gridded metadata - each point 
of our grids has a variety of metadata attached to it (including lat/lon, 
measurements, etc) and would constitute a record in Postgresql+Postgis.

Size-wise, grids are about 4000x700 and are collected twice daily over say 10 
years.  As mentioned, each record would have up to 50 metadata attributes 
(columns) including geom, floats, varchars etc.

So given 4000x700x2x365x10 > 2 billion, is this going to  be a problem if we 
will be wanting to query on datetimes, Postgis lat/lon, and integer-based 
metadata flags?

Hmmmm... 2 billion looks optimistic... I get 2.044e+10, which is 20 billion.

I'm currently working with a "table" of over 200,000,000 records. With a 
clustered index & underlying partitioning, response times are more than 
acceptable (a 25 wide self relation (left outer join) for 3 months data 
(records are around 40/minute) is about 2 minutes. Simple query with a 
where clause on timestamp is a few secs at most. This is on a 32 bit 
Intel system with only 2Gb memory & mirrored 7200RPM SATA hard drives.

I'd suggest partition on timestamp, maybe per year at least, & use a 
clustered index on timestamp. It should be viable if your hardware is up 
to it.

I'd also strongly recommend a suitable platform,  64 bit Linux on AMD 64 
or Opteron with as much memory & the fastest  RAID setup you can afford. 
Make sure you use a fully 64 bit version of Postgres/Postgis on this 
platform as well.

If the same grid is being resampled, then  a separate table defining the 
grid, and a join on grid ID to the main (partitioned) table may improve 
performance (& reduce data volume).

I wouldn't expect instantaneous answers from it, but don't know of any 
reason it won't work. Depends very much on what level of performance is 
considered acceptable.

I'm also using Postgis grids with up to a few million cells and spatial 
joins to a millions or so tracklines to generate gridded models for 
analysis in R. You seem to be scaling this up from where I'm at, so I'd 
like to know how you get on..

Cheers,

  Brent Wood