Re: vacuumdb question/problem

"Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> · Thu, 21 Jul 2011 10:55:48 -0500

David Ondrejik <David.Ondrejik@xxxxxxxx> wrote:

> The posting of data to the table in question is extremely
> slow...yesterday I saw that it took over 6 min to post just 124
> rows of data. That is just not acceptable. Additionally, we have
> about 9,000 to 11,000 products that come in daily (some contain
> one row of data...others may be several hundred lines) and are
> posted to the database. The system used to run with maybe 10-100
> products in the queue (that is before I posted over 200 million
> rows). Yesterday, there were over 25,000 products in the queue
> waiting to be processed - which is bad for our operational use.

Sounds like bloat.

> I think the table got even more bloated when I tried to do the
> vacuum full last week and had to kill the process - it never
> completed.

VACUUM FULL can increase bloat if it doesn't complete.  Even when it
completes it tends to bloat indexes.

> From the info I received from previous posts, I am going to change
> my game plan. If anyone has thoughts as to different process or
> can confirm that I am on the right track, I would appreciate your
> input.
> 
> 1. I am going to run a CLUSTER on the table instead of a VACUUM
> FULL. 

If you have room for a second copy of your data, that is almost
always much faster, and less prone to problems.

> But I have a few questions that are not clear from the info I have
> found on the internet regarding this process.  The table name is
> 'pecrsep', and the database is hd_ob92rha. It has multiple columns
> in the table, but I am choosing the 'obsdate' as my index (I
> believe 'index' refers to the name of a column in the table?).

No, it refers to the name of an index.  For example, you might have
an index to support your primary key declaration named
"pecrsep_pkey".  If you are using psql, type

\d pecrsep

When I do that for our "Party" table in our statewide copy of the
data, I see these lines near the bottom:

Indexes:
    "Party_pkey" PRIMARY KEY, btree ("countyNo", "caseNo",
"partyNo") CLUSTER
    "Party_EAccountNo" btree ("countyNo", "eAccountNo")
    "Party_SearchName" btree ("searchName", "countyNo")

So I could choose to cluster this table using "Party_pkey",
"Party_EAccountNo", or "Party_SearchName".  The primary key has been
set as the default if I don't specify an index.

> Anyway, if I interpret the info correctly the first time a CLUSTER
> is run, I have to run something like:
> 
> CLUSTER obsdate ON pecrsep (CLUSTER /indexname/ ON /tablename/)

For 8.2 that is correct.  (In more recent versions that syntax is
deprecated in favor of a new wording which is more natural for most
people.)

> Is it true that the first time I run a CLUSTER, I must provide an 
> 'indexname' or can I just run: CLUSTER pecrsep (CLUSTER
> tablename)? Or is it better to provide an indexname?

If an index name has been set for a default on that table, and
that's the index you want to use, it really doesn't matter.

> 2. I know there are a lot of variables such as table size, etc.
> Are we looking at something that will take couple hours...or 24+
> hours?

Before I gave up on VACUUM FULL as an overall bad idea for us, I had
left a VACUUM FULL on a large table (which had some bloat) running
when I left work Friday, and found it still running Monday morning. 
I canceled the VACUUM FULL, noticed it was now *further* bloated,
and used CLUSTER.  In our case, with that table, it finished in a
few hours.

> Is there a way to monitor the CLUSTER process to ensure its
> working?

I guess you could look at the generated files, but I've never heard
of it failing, short of running the data area out of space.

> Is there any way to get output to give an estimate as to how much
> of the CLUSTER has completed...and how much is left to run?

If you calculated an estimate of the non-bloated heap space you
could watch the files being created for the new copy as it runs. 
Remember that all the indexes also need to be built after the heap
is populated.

> 3. With the info from previous posts, I am certainly rethinking
> the use of VACUUM FULL.

Good.

> After I run the CLUSTER, should I REINDEX the table, or is that 
> redundant to the CLUSTER?

That would be redundant.  At completion of the CLUSTER, the table
has brand new, freshly built indexes.

> Any idea as to how long this may take (i.e 1-2 hours or 24+
> hours)?

If you know the speed of your drives, you can set a lower bound on
it, I guess.  *At least* as much time as it takes to read the entire
table through the specified index, plus the time to sequentially
write the un-bloated heap, plus the time to build all indexes on the
table.  Caching effects can make this tricky to estimate.

> 4. If either/both the CLUSTER or REINDEX are successful, that
> would indicate that I don't need to run a VACUUM FULL...correct?

REINDEX would not eliminate heap bloat.

> 5. Then I would run a VACUUM ANALYZE as suggested.

Be sure to run autovacuum.  You probably want to run it with
settings more aggressive than the 8.2 defaults.  We generally
supplement that with a nightly or weekly VACUUM ANALYZE VERBOSE. 
That was especially important prior to 8.4, as you would get
information at the end of the listing to help determine when you
should adjust the free space manager (fsm) allocations.  (In 8.4 and
later such allocations are automatic, so you no longer need to worry
about setting these high enough to prevent ongoing performance
degeneration.)

> After I do all this on the single table, should I repeat something
> similar for the whole database? Or should I just attack the
> largest tables?

All tables which are being modified should be regularly vacuumed and
analyzed.  CLUSTER is only indicated for correcting bloat and
certain particular usage patterns (where there is an index on which
a number of consecutive rows are frequently accessed).

> One last question, can anyone recommend a good postgres book for
> me to purchase?

Yes.  "PostgreSQL 9.0 High Performance" and "PostgreSQL 9
Administration Cookbook":

http://www.postgresql.org/docs/books/

> I guess I should repeat that we are still running version 
> 8.2.6 of postgres on Linux machines.

It would still be a good idea to update to 8.2.recent (which doesn't
require a conversion or introduce changes in behavior beyond bug and
security fixes).  Once you get things stabilized you should probably
get 9.0 or 9.1 on your radar.  (At this point I wouldn't consider
just going to 8.3 or 8.4 anymore.)  Much better performance, many
cool new features,, and keeping within the support window all make
it worthwhile.

> Is there an appropriate book for this version available?

The books I mentioned above do cover older versions.

-Kevin

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin