Slow sequential scans on one DB but not another; fragmentation?

Stephen Harris <lists@xxxxxxxxxx> · Wed, 28 Mar 2007 09:55:35 -0400

This 8.0.8 on Solaris 8.  Yes I know; old technologies but we can't 
upgrade yet.

I have two identical servers.  One is production and overnight we do
a complete dump and scp the results to the other.  The other is standby.
It's currently running data that's about 7 months old because we haven't
needed to fail over yet.

I have one specific table ("sweep_users") that has 900,000 rows in prod
and 630,000 on the standby.  On the standby a "select count(*) from
sweep_users" takes a couple of seconds.  On production it takes... 240
seconds!

Data updates on this table consist of the following meta-logic:

  for host in list_of_hosts
    delete from sweep_users where hostid=host
    for user in users_for_host
      insert into sweep_users ....
  vacuum analyze sweep_users

I'm at a loss to understand why the production server is so slow.  While
the query is running "iostat -x" returns values like (on a striped mirror):

  device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
  md30       764.0    0.0 92110.5    0.0  0.0  1.3    1.6   0  87 
  md30      1666.8    0.0 67254.3    0.0  0.0  1.7    1.0   0  92 
  md30       844.6    0.4 75716.1    0.3  0.0  1.3    1.6   0  90 

The disk on the standby machine only shows 1/20th of that activity.

Now there is a difference in tuning between these values where we tried
to allocate more memory to the database to make queries more likely to be
in RAM (but we could have made a mistake)...

  % diff prod standby
  <  effective_cache_size           | 262144
  ---
  >  effective_cache_size           | 1000
  92c92
  <  maintenance_work_mem           | 524288
  ---
  >  maintenance_work_mem           | 16384
  106c106
  <  random_page_cost               | 1.5
  ---
  >  random_page_cost               | 4
  113c113
  <  shared_buffers                 | 30000
  ---
  >  shared_buffers                 | 1000
  141c141
  <  work_mem                       | 20480
  ---
  >  work_mem                       | 1024

The only idea I have is that our update pattern is somehow causing excessive
fragmentation, either at the DB level or the OS file level.

Anyone else have any thoughts?

-- 

rgds
Stephen