INDEX Performance Issue

Mark Davidson <mark@xxxxxxxxxxx> · Fri, 5 Apr 2013 16:51:37 +0100

Hi All, 

Hoping someone can help me out with some performance issues I'm having with the INDEX on my database. I've got a database that has a data table containing ~55,000,000 rows which have point data and an area table containing ~3,500 rows which have polygon data. A user queries the data by selecting what areas they want to view and using some other filters such as datatime and what datasets they want to query. This all works fine and previously the intersect of the data rows to the areas was being done on the fly with PostGIS ST_Intersects. However as the data table grow we decided it would make sense to offload the data processing and not calculate the intersect for a row on the fly each time, but to pre-calculate it and store the result in the join table. Resultantly this produce a table data_area which contains ~250,000,000 rows. This simply has two columns which show the intersect between data and area. We where expecting that this would give a significant performance improvement to query time, but the query seems to take a very long time to analyse the INDEX as part of the query. I'm thinking there must be something wrong with my setup or the query its self as I'm sure postgres will perform better. 

I've tried restructuring the query, changing config settings and doing maintenance like VACUUM but nothing has helped. 

Hope that introduction is clear enough and makes sense if anything is unclear please let me know.

I'm using PostgreSQL 9.1.4 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3, 64-bit on Ubuntu 12.04 which was installed using apt. 

Here is the structure of my database tables 

CREATE TABLE data
(
  id bigserial NOT NULL,
  datasetid integer NOT NULL,
  readingdatetime timestamp without time zone NOT NULL,
  depth double precision NOT NULL,
  readingdatetime2 timestamp without time zone,

  depth2 double precision,
  value double precision NOT NULL,
  uploaddatetime timestamp without time zone,
  description character varying(255),
  point geometry,
  point2 geometry,
  CONSTRAINT "DATAPRIMARYKEY" PRIMARY KEY (id ),

  CONSTRAINT enforce_dims_point CHECK (st_ndims(point) = 2),
  CONSTRAINT enforce_dims_point2 CHECK (st_ndims(point2) = 2),
  CONSTRAINT enforce_geotype_point CHECK (geometrytype(point) = 'POINT'::text OR point IS NULL),

  CONSTRAINT enforce_geotype_point2 CHECK (geometrytype(point2) = 'POINT'::text OR point2 IS NULL),
  CONSTRAINT enforce_srid_point CHECK (st_srid(point) = 4326),
  CONSTRAINT enforce_srid_point2 CHECK (st_srid(point2) = 4326)

);

CREATE INDEX data_datasetid_index ON data USING btree (datasetid );
CREATE INDEX data_point_index ON data USING gist (point );
CREATE INDEX "data_readingDatetime_index" ON data USING btree (readingdatetime );

ALTER TABLE data CLUSTER ON "data_readingDatetime_index";

CREATE TABLE area
(
  id serial NOT NULL,
  "areaCode" character varying(10) NOT NULL,
  country character varying(250) NOT NULL,

  "polysetID" integer NOT NULL,
  polygon geometry,
  CONSTRAINT area_primary_key PRIMARY KEY (id ),
  CONSTRAINT polyset_foreign_key FOREIGN KEY ("polysetID")
      REFERENCES polyset (id) MATCH SIMPLE

      ON UPDATE CASCADE ON DELETE CASCADE,
  CONSTRAINT enforce_dims_area CHECK (st_ndims(polygon) = 2),
  CONSTRAINT enforce_geotype_area CHECK (geometrytype(polygon) = 'POLYGON'::text OR polygon IS NULL),

  CONSTRAINT enforce_srid_area CHECK (st_srid(polygon) = 4326)
);

CREATE INDEX area_polygon_index ON area USING gist (polygon );
CREATE INDEX "area_polysetID_index" ON area USING btree ("polysetID" );

ALTER TABLE area CLUSTER ON "area_polysetID_index";

CREATE TABLE data_area
(
  data_id integer NOT NULL,
  area_id integer NOT NULL,
  CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),

  CONSTRAINT data_area_area_id_fk FOREIGN KEY (area_id)
      REFERENCES area (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT data_area_data_id_fk FOREIGN KEY (data_id)
      REFERENCES data (id) MATCH SIMPLE

      ON UPDATE CASCADE ON DELETE CASCADE
);

Here is the query I'm running and the result of its explain can be found here http://explain.depesz.com/s/1yu

SELECT * FROM data d JOIN data_area da ON da.data_id = d.id LEFT JOIN area a ON da.area_id = a.id WHERE d.datasetid IN (5634,5635,5636,5637,5638,5639,5640,5641,5642)   AND da.area_id IN

(1, 2, 3 .... 9999) AND (readingdatetime BETWEEN '1990-01-01' AND '2013-01-01') AND depth BETWEEN 0 AND 99999;

If you look at the explain the index scan is taking 97% of the time is spent on the index scan for the JOIN of data_area. 

Hardware

 - CPU: Intel(R) Xeon(R) CPU E5420 ( 8 Cores )
 - RAM: 16GB

Config Changes

I'm using the base Ubuntu config apart from the following changes

 - shared_buffers set to 2GB
 - work_mem set to 1GB
 - maintenance_work_men set to 512MB
 - effective_cache_size set to 8GB

Think that covers everything hope this has enough detail for someone to be able to help if there is anything I've missed please let me know and I'll add any more info needed. Any input on the optimisation of the table structure, the query, or anything else I can do to sort this issue would be most appreciated. 

Thanks in advance, 

Mark Davidson