Re: significant vacuum issues - looking for suggestions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In response to Kevin Kempter <kevin@xxxxxxxxxxxxxxxxxxx>:

> Hi List;
> 
> I've just started working with a client that has been running Postgres (with 
> no DBA) for a few years. They're running  version 8.1.4 on 4-way dell boxes 
> with 4Gig of memory on each box attached to RAID-10 disk arrays. 
> 
> Some of their key config settings are here:
> shared_buffers = 20480
> work_mem = 16384
> maintenance_work_mem = 32758

Before you do any of those other things, bump shared_buffers to about
120000 and maintenance_work_mem to 250000 or so -- unless this box
has other applications on it using significant amounts of those 4G of
RAM.  You may find that these changes alone are enough to get vacuum
to complete.  You'll need to restart the server for the shared_buffers
setting to take effect.

Can you do a pg_relation_size() on the tables in question?

> wal_buffers = 24
> checkpoint_segments = 32
> checkpoint_timeout = 300
> checkpoint_warning = 30
> effective_cache_size = 524288
> autovacuum = on
> autovacuum_naptime = 60
> autovacuum_vacuum_threshold = 500
> autovacuum_analyze_threshold = 250
> autovacuum_vacuum_scale_factor = 0.2
> autovacuum_analyze_scale_factor = 0.1
> autovacuum_vacuum_cost_delay = -1
> autovacuum_vacuum_cost_limit = -1
> 
> 
> Currently I've found that they have approx 17 tables that show a significant 
> amount of bloat in the system. The worst one showing over 5million  pages 
> worth of dead space. One of the problems is that their system is so busy with 
> activity during the day and massive data load processes at night that they 
> often kill the pid of vacuum processes kicked off by autovacuum because the 
> overall load impact disallows users from logging into the app since the login 
> process includes at least one db query that then seems to hang because there 
> are anywhere from 100 - 300 queries ahead of it at any given time. Normally a 
> user gets logged in with an avg wait of 5 - 10 seconds but when a long 
> running vacuum (sometimes due to a long running update process that's trying 
> to sort/update > 40million rows)  is going the system gets to a state where 
> the login queries never get executed until the vacuum process is killed.
> 
> As a result of this I believe that the biggest table (the one with > 5million 
> pages worth of dead space) has never been vacuumed to completion. I suspect 
> this is the case for a few of the other top dead space tables as well but I 
> can't be sure. 
> 
> My first priority was to get this vacuum scenario cleaned up. First off I 
> added the biggest table into pg_autovacuum and set the enabled column to 
> false ('f'). Then I set vacuum_cost_delay to 10 and in the same session 
> ran "vacuum analyze verbose big_table".  This ran for 7.5 hours before we had 
> to kill it due to system load - and to make matters worse the high system 
> load was forcing many of the nightly batch queries that load, update, etc the 
> data to stack up to a point where the system was at less than 2% idle (CPU) 
> for the next 4 hours and barely responding to the command line.
> 
> To make matters worse I find out this morning that the db is at 85% per used 
> transaction ID's - again since a vacuum on the entire db has never been 
> completed. 
> 
> As far as I can tell, the overall db size is currently 199G of which approx 
> 104G seems to be valid data.
> 
> Here's my thoughts per how to proceed:
> 
> =====================================
> 1) fix the big table ASAP (probably over the weekend) since it's not only the 
> biggest table but the most active like this:
> 
>    a) run a pg_dump of this table
> 
>    b)  restore this dump into a new table (i.e. new_big_table)
> 
>    c) lock the original big_table, sync any changes, inserts, deletes since we 
> did the dump from big_table into new_big_table
> 
>    d) drop big_table
> 
>    e) re-name new_big_table to big_table
> 
> * I may run through this for a few of the other large, highly active tables 
> that have minimal page density as well.
> =====================================
> 
> 
> The development folks that have been here awhile tell me that it seems like 
> when they have a query (not limited to vacuum processes) that has been 
> running for a long time (i.e. > 5 or 6 hours) that the query sort of "goes 
> crazy" and the entire system gets pegged until they kill that process. - I've 
> not heard of this but I suspect upgrading to 8.2.4 is probably a good plan at 
> this point as well, so for step 2, I'll do this:
> 
> =====================================
> 2) (obviously I'll do this in dev first, then in QA and finally in prod)
>    a) install verson 8.2.4 from source, leaving 8.1.4 in place
> 
>    b) create the new 8.2.4 cluster on a new port
> 
>    c) setup WAL archiving on the 8.1.4 cluster
> 
>    d) do a full dump of the 8.1.4 cluster and restore it to the new 8.2.4 
> cluster
> 
>    e) stop the 8.2.4 cluster and bring it up in recovery mode, pointing it to 
> the directory where we're archiving the 8.1.4 cluster's WAL segments.
> 
>    f) once caught up, bring both clusters down
> 
>    g) copy any final files from the 8.1.4 cluster's pg_xlog directory into the 
> new 8.2.4 pg_xlog dir (is this ok, since I'm moving 8.1.4 version tx logs 
> into an 8.2.4 xlog dir?)
> 
>    h) Change the port on the 8.2.4 cluster to what the original 8.1.4 cluster 
> port was
> 
>    i) bring up the new 8.2.4 system, and actively manage the vacuum needs 
> moving fwd via a combination of autovacuum, cron processes for specififed 
> table vac's (daily, hourly, 15min, 5min, etc), and as needed interactive 
> session vacuums
> =====================================
> 
> 
> The src based install will allow me to setup a robust upgrade CM process 
> capable of supporting multiple concurrent versions on a server if needed, the 
> ability to quickly revert to a previous version, etc however this is a 
> discussion for another day - I only mention it in case the question "why not 
> just use RPM's?" arises...
> 
> 
> So here's my questions:
> 
> 1) Does this sound like a good plan?
> 
> 2) Are there other steps I should be taking, other Issues I should be 
> concerned about short-term, etc? 
> 
> 3) Does anyone have any additional advice for managing either this initial 
> mess, or the system(s) long term?
> 
> Thanks in advance...
> 
> /Kevin
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 
> 
> 
> 
> 
> 


-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@xxxxxxxxxxxxxxxxxxxxxxx
Phone: 412-422-3463x4023

****************************************************************
IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux