FW: Slow query performance

"Dann Corbit" <DCorbit@xxxxxxxxx> · Wed, 29 Oct 2008 15:38:25 -0700

From: Kevin Galligan
[mailto:kgalligan@xxxxxxxxx] 

Sent: Wednesday, October 29, 2008 3:16 PM

To: Dann Corbit

Subject: Re:  Slow query performance

Columns are as follows:

account  | integer    
          |

city     | character varying(20) |

zip      | character(5)          |

dincome  | character(1)          |

sex      | character(1)          |

mob      | boolean            
  |

religion | character(1)          |

groupcd  | character(1)          |

lastdata | character varying(4)  |

countycd | character varying(3)  |

state    | character varying(2)  |

dutype   | character varying(1)  |

orders   | integer               |

countysz | character varying(1)  |

ethnic   | character varying(2)  |

language | character varying(2)  |

cbsacode | character varying(5)  |

cbsatype | character varying(1)  |

age      | smallint            
 |

dob      | date              
   |

ccard    | boolean               |

lor      | integer            
  |

bankcard | boolean               |

lastord  | date                
 |

total    | integer               |

lmob     | boolean               |

homeown  | character varying(1)  |

ord1st   | date                
 |

ordlast  | date                
 |

married  | boolean               |

deptcard | boolean               |

>>

You
did not show us the indexes.

If
you have pgadmin III, go to the table and copy/paste the actual definition,
including indexes.

<<

>From here its about another 100 columns with either booleans or smallints,
mostly null values.

I eventually killed the vacuum.  I will run it again, but was just going
through the indexes.  All were of this format...

"INFO:  index "idx_jordlast" now contains 265658026 row
versions in 728409 pages

DETAIL:  0 index row versions were removed.

0 index pages have been deleted, 0 are currently reusable.

CPU 0.88s/0.13u sec elapsed 90.38 sec."

An example of a slow query is...

select count(*) from bigdatatable where age between 22 and 40 and state = 'NY';

>>

Is
there an index on state and age?

Is
there an index on state?

Is
there an index on age?

That
is important missing information.

If
there is no index on either column, then you will do a table scan.

If
all of your slow queries look like the above, then create a clustered index on
state,age

<<

I know count is not optimized on postgresql like it is on mysql due to
transaction isolation (at least that's what I've read.  Makes sense to
me).  I understand it'll take time to actually count the rows. 
However, here's the output of 'explain analyze select count(*) from
bigdatatable where age between 22 and 40 and state = 'NY';'

 Aggregate  (cost=5179639.55..5179639.56 rows=1 width=0) (actual
time=389529.895..389529.897 rows=1 loops=1)

   ->  Bitmap Heap Scan on bigdatatable 
(cost=285410.65..5172649.63 rows=2795968 width=0) (actual
time=6727.848..387159.175                                
rows=2553273 loops=1)

         Recheck Cond: ((state)::text =
'NY'::text)

         Filter: ((age >= 22) AND
(age <= 40))

         ->  Bitmap Index Scan
on idx_jstate  (cost=0.00..284711.66 rows=15425370 width=0) (actual
time=6298.950..6298.950
ro                               
ws=16821828 loops=1)

Index Cond: ((state)::text = 'NY'::text)

 Total runtime: 389544.088 ms

It looks like the index scans are around 6 seconds or so each, which is
fine.  then it looks like "Aggregate" suddenly jumps up to 6
minutes.

I know the database design is crude.  Its really just a big flat
table.  I didn't put too much into weeding out which columns should be
indexed and which shouldn't (just slapped an index on each).  Happy to do
that work, but right now I'm in panic mode and just need to figure out which
way to start going.  I had a design on mysql which worked pretty good at
10 to 20 % of full size, but degraded quite a bit at full size. 
compounding this is there is another implementation we've seen that uses the
full size of similar data and returns actual results in seconds (I originally
planned to used a 5% size db for estimated results, then the full size for
getting the actual data.  This plan was rejected :(

Any thoughts?  It seemed to work OK when I had a table with 10 cols but
about the same data length.  That may have been an artificial test,
though.

Again.  This is read-only once the data is set up.  Client wants to
run pretty much arbitrary queries, so its hard to isolate certain things for
optimization, although there are some "knowns".

Will start the full vacuum process again.

Thanks in advance,

-Kevin

On Wed, Oct 29, 2008 at 4:52 PM, Dann Corbit <DCorbit@xxxxxxxxx> wrote:

>> -----Original Message-----

>> From: pgsql-general-owner@xxxxxxxxxxxxxx
[mailto:pgsql-general-

>> owner@xxxxxxxxxxxxxx] On
Behalf Of Kevin Galligan

>> Sent: Wednesday, October 29, 2008 1:18 PM

>> To: pgsql-general@xxxxxxxxxxxxxx

>> Subject:  Slow query performance

>>

>> I'm approaching the end of my rope here.  I have a large
database.

>> 250 million rows (ish).  Each row has potentially about 500
pieces of

>> data, although most of the columns are sparsely populated.

>>

>> What I'm trying to do is, essentially, search for sub-sets of that

>> data based on arbitrary queries of those data columns.  the
queries

>> would be relatively simple ("dirbtl is not null and qqrq between
20

>> and 40").  After the database is built, it is read only.

>>

>> So, I started with maybe 10-15 fields in a main table, as most records

>> have values for those fields.  Then had individual tables for the

>> other values.  The idea is that the percentage of rows with
values

>> drops off significantly after those main tables.  That, an each

>> individual query looks at probably 3 or 4 fields in total.  The

>> performance of those queries was pretty bad.  Its got to join
large

>> numbers of values, which didn't really work out well.

>>

>> So, went the other direction completely.  I rebuilt the database
with

>> a much larger main table.  Any values with 5% or greater filled
in

>> rows were added to this table.  Maybe 130 columns.  Indexes
applied to

>> most of these.  Some limited testing with a smaller table seemed
to

>> indicate that queries on a single table without a join would work much

>> faster.

>>

>> So, built that huge table.  now query time is terrible.
 Maybe a

>> minute or more for simple queries.

>>

>> I'm running vacuum/analyze right now (which is also taking forever,

>> BTW).

>>

>> The box has 15 g of ram.  I made the shared_buffers setting to 8
or 9

>> gig.  My first question, what would be better to bump up to
increase

>> the performance?  I thought that was the field to jack up to
improve

>> query time or index caching, but I've read conflicting data.  The
15

>> ram is available.

>>

>> I originally had this in mysql.  Also bad performance.  I
understand

>> how to optimize that much better, but it just wasn't cutting it.

>>

>> Anyway, help with tuning the settings would be greatly appreciated.

>> Advice on how best to lay this out would also be helpful (I know its

>> difficult without serious detail).

>

> Show the full schema for your table and the output of:

> VACUUM VERBOSE ANALYZE <your_table>

>

> Show the query that is slow.

>

>