Re: Re: significant performance hit whenever autovacuum runs after upgrading from 9.0 -> 9.1

Lonni J Friedman <netllama@xxxxxxxxx> · Thu, 24 May 2012 12:28:14 -0700

On Wed, May 23, 2012 at 2:45 PM, Gavin Flower
<GavinFlower@xxxxxxxxxxxxxxxxx> wrote:
> On 24/05/12 08:18, Lonni J Friedman wrote:
>
> On Wed, May 23, 2012 at 12:36 PM, Gavin Flower
> <GavinFlower@xxxxxxxxxxxxxxxxx> wrote:
>
> On 24/05/12 05:09, Lonni J Friedman wrote:
>
> On Wed, May 23, 2012 at 9:37 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>
> Lonni J Friedman <netllama@xxxxxxxxx> writes:
>
> After banging my head on the wall for  a long time, I happened to
> notice that khugepaged was consuming 100% CPU every time autovacuum
> was running.  I did:
> echo "madvise" > /sys/kernel/mm/transparent_hugepage/defrag
> and immediately the entire problem went away.
>
> Fascinating.
>
> In hindsight, sure.  Before that, it was 2 days of horror.
>
> So this looks like a nasty Fedora16 kernel bug to me, or maybe
> postgresql & Fedora16's default kernel settings are just not
> compatible?
>
> I agree, kernel bug.  What kernel version are you using exactly?
>
> I'm using the stock 3.3.5-2.fc16.x86_64 kernel that is in Fedora updates.
>
> Is anyone else using Fedora16 & PostgreSQL-9.1 ?
>
> I use an F16 box daily, but can't claim to have done major performance
> testing with it.  Can you put together a summary of your nondefault
> Postgres settings?  I wonder whether it only kicks in for a certain
> size of shared memory for instance.
>
> Oh yea, I'm quite certain that this is somehow related to my setup,
> and not a generic problem with all F16/pgsql systems.  For starters,
> this problem isn't happening on any of the 3 standby systems, which
> are all otherwise identical to the master in every respect.  Also when
> we had done some testing (prior to the upgrades), we never ran into
> any of these problems.  However our test environment was on smaller
> scale hardware, with a much smaller number of clients (and overall
> load).
>
> Here are the non default settings in postgresql.conf :
> wal_level = hot_standby
> archive_mode = on
> archive_timeout = 61
> max_wal_senders = 10
> wal_keep_segments = 5000
> hot_standby = on
> log_autovacuum_min_duration = 2500
> autovacuum_max_workers = 4
> maintenance_work_mem = 1GB
> checkpoint_completion_target = 0.7
> effective_cache_size = 88GB
> work_mem = 576MB
> wal_buffers = 16MB
> checkpoint_segments = 64
> shared_buffers = 8GB
> max_connections = 350
>
> Let me know if you have any other questions.  I'd be happy to provide
> as much information as possible if it can aid in fixing this bug.
>
> I think they will need details of things like: RAM, number/type processors,
> number & type
> of disks, disk controllers & any other hardware specs that might be relevant
> etc.- at very
> least: total RAM & number of spindles
>
> 16 core Xeon X5550 2.67GHz
> 128GB RAM
> $PGDATA sits on a RAID5 array comprised of 3 SATA disks.  Its Linux's
> md software RAID.
>
> How does this compare to your other machines running the same, or similar,
> databases?
> However, you do say that the other machines are indentical - but are the
> other
> machines different in any aspect, that might prove siginificant?
>
>
>
> Also anything else running on the box.
>
> nothing else.  its dedicated exclusively to postgresql.
>
> Plus transaction load pattern - over time and read/write ratios.
>
> I'm not sure how I'd obtain this data.  however, the patterns didn't
> change since the upgrade.  If someone can point me in the right
> direction, I can at least obtain this data as its generated currently.
>
> type/nature of queries
>
> I need some clarification on specifically what you're asking for here.
>
> The complexity, structure, and features of the queries. Do you have lots of
> sub queries,
> and ORDER BY's? Also the number of tables accessed in a query. This is
> heading into the
> territory where others will be better placed to advise you as to what might
> be relevant!

No, not lots of subqueries or ORDERing, and most queries only touch a
single table.  However, I'm honestly not sure that I'm following where
you're going with this.   The problem isn't triggered by explicit
queries.  I can disable all external access, and simply wait for
autovacuum to kick off, and the box starts to die.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general