Re: Atomic access to large arrays

Victor de Buen <vdebuen@xxxxxxxxx> · Wed, 22 Jul 2009 10:12:42 +0200

Hi

I'm storing historical meteorological gridded data from GFS (http://www.nco.ncep.noaa.gov/pmb/products/gfs/) into an array field in a table like this:

CREATE TABLE grid_f_data_i2 (
  //Specifies the variable and other features of data

  id_inventory integer REFERENCES grid_d_inventory(id_inventory), 
  //A new grid is available each 3 hours since 5 years ago
  dh_date timestamp,  
  //Values are scaled to be stored as signed integers of 2 bytes

  vl_grid smallint[361][720],   
CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY 
  (co_inventory, dh_date)
);

Dimensions
of each value of field vl_grid are (lat:361 x lon:720 = 259920 cells}
for a grid of 0.5 degrees (about each 55 Km) around the world. So,
vl_grid[y][x] stores the value at dh_date of a meteorological variable
specified by id_inventory in the geodesic point 

latitude  = -90 + y*0.5
longitude = x*0.5

The reverse formula for the closest point in the grid of an arbitary geodesic point will be

y = Round((latitude+90) * 2

x = Round(longitude*2)

Field vl_grid is stored in the TOAST table and has a good
compression level. PostgreSql is the only one database that is able to
store this huge amount of data in only 34 GB of disk. It's really great
system. Queries returning big rectangular areas are very fast, but the
target of almost all queries is to get historical series for a geodesic
point

SELECT  dh_date, vl_grid[123][152]
FROM  grid_f_data_i2
WHERE  id_inventory = 6
ORDER BY dh_date

In this case, atomic access to just a cell of each one of a only few thousands of rows becomes too slow.

Using standar way, size increase very much
CREATE TABLE grid_f_data_i2 (

   id_inventory integer REFERENCES grid_d_inventory(id_inventory),
   dh_date timestamp,
   smallint lat,

   smallint lon,
   smallint value
};

This table have (4+8+2+2+2=24) bytes by register and (lat:361 x lon:720 = 259920) registers by grid, so, 6238080 bytes by grid.
Uncompressed array design uses 4+8+2*259920=519852 bytes by register but just one register by grid.

TOAST table compress these arrays with an average factor 1/2, so, the
total size with arrays is 24 times lesser than standard way.

Now,
I have more than 60000 stored grids in 30 GB, instead of 720 GB, but
probably I will store 1 million of grids or more in 0.5 TB instead of
12 TB.

I have no enougth money to buy nor maintain 12 TB disks.

Please, could somebody answer some of these questions?
It's posible to tune some TOAST parameters to get faster atomic access to large arrays?
Using "EXTERNAL" strategy for storing TOAST-able columns could solve the problem?     
Atomic access will be faster if table stores vectors for data in the same parallel instead of matrices of global data?

CREATE TABLE grid_f_data_i2 (

  //Specifies the variable and other features of data

  id_inventory integer REFERENCES grid_d_inventory(id_inventory), 
  //A new grid is available each 3 hours since 5 years ago

  dh_date timestamp,
  // nu_parallel = y = Round((latitude+90) * 2

  smallint nu_parallel,

  //Values are scaled to be stored as signed integers of 2 bytes

  vl_parallel smallint[],

CONSTRAINT meteo_f_gfs_tmp PRIMARY KEY 

  (co_inventory, nu_parallel, dh_date)

);

There is another faster solution?
Thanks in advance and best regards

-- 
Víctor de Buen Remiro
Tol Development Team member
www.tol-project.org