Re: CPU spikes and transactions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 15, 2013 at 08:59:08AM -0700, Tony Kay wrote:
> On Mon, Oct 14, 2013 at 4:42 PM, Tomas Vondra <tv@xxxxxxxx> wrote:
> 
> > On 15.10.2013 01:00, Tony Kay wrote:
> > > Hi,
> > >
> > > I'm running 9.1.6 w/22GB shared buffers, and 32GB overall RAM on a
> > > 16 Opteron 6276 CPU box. We limit connections to roughly 120, but
> > > our webapp is configured to allocate a thread-local connection, so
> > > those connections are rarely doing anything more than half the time.
> >
> > Lower your shared buffers to about 20% of your RAM, unless you've tested
> > it's actually helping in your particular case. It's unlikely you'll get
> > better performance by using more than that, especially on older
> > versions, so it's wiser to leave the rest for page cache.
> >
> > It might even be one of the causes of the performance issue you're
> > seeing, as shared buffers are not exactly overhead-free.
> >
> > See this for more details on tuning:
> >
> >    http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
> 
> 
> I had followed the general directions from several sources years ago, which
> indicate up to 40% of RAM. We've been running very large shared buffers for

in general it's best to start with 10-15% of the RAM and no more then
2-4 GB

> 4 years now, but it is difficult to generate a good real load without
> testing against users, so we have not felt the need to move it around. In
> general, I don't tend to tinker with a setting that has been fine for this
> long without good reason. I've been wanting to upgrade to the newer
> mmap-based versions of pgsql, but was waiting to re-tune this when I did so.
> 
> Why do you suspect that shared_buffers would cause the behavior I'm seeing?
> 

for two reasons:

- some of the overhead of bgwriter and checkpoints is more or less linear 
in the size of shared_buffers, for example it could be possible that a
large quantity of data could be dirty when a checkpoint occurs).

- the OS cache is also being used for reads and writes, the larger
  shared_buffers is, the more you risk double buffering (same blocks 
  in the OS cache and in the database buffer cache).

> 
> >
> >
> > You're on a rather old 9.1.x version, BTW. The last version in this
> > branch is 9.1.10 and there are some important security fixes (e.g. in
> > 9.1.9). Not sure if there are any fixes relevant to the performance
> > issue, though.
> >
> > An upgrade to 9.1.10 is planned.
> 
> 
> > A few initial questions:
> >
> > * What OS are we dealing with?
> >
> 
> CentOS el6
> 
> 
> >
> > * So how many active connections are there on average (see
> >   pg_stat_activity for connections running queries)?
> >
> 
> about 40-60
> 
> 
> >
> > * How much data are we talking about? In total and in the imports?
> >
> 
> 80GB database. The imports are maybe 1-3 MB...often much smaller. 10k rows
> would be a probably average.
> 
> 
> >
> > > We have been running smoothly for over a year on this configuration,
> > > and recently started having huge CPU spikes that bring the system to
> > > its knees. Given that it is a multiuser system, it has been quite
> > > hard to pinpoint the exact cause, but I think we've narrowed it down
> > > to two data import jobs that were running in semi-long transactions
> > > (clusters of row inserts).
> > >
> > > The tables affected by these inserts are used in common queries.
> > >
> > > The imports will bring in a row count of perhaps 10k on average
> > > covering 4 tables.
> > >
> > > The insert transactions are at isolation level read committed (the
> > > default for the JDBC driver).
> > >
> > > When the import would run (again, theory...we have not been able to
> > > reproduce), we would end up maxed out on CPU, with a load average of
> > > 50 for 16 CPUs (our normal busy usage is a load average of 5 out of
> > > 16 CPUs).
> > >
> > > When looking at the active queries, most of them are against the
> > > tables that are affected by these imports.
> >
> > Which processes consume most CPU time? Are those backends executing the
> > queries, or some background processes (checkpointer, autovacuum, ...)?
> >
> >
> The backends executing the queries...most of the queries that seem hung
> usually run in a few ms.
> 
> 
> > Can you post a "top -c" output collected at the time of the CPU peak?
> >
> >
> Don't have process accounting, so I cannot regenerate that; however, I can
> tell you what queries were active at one of them.
> 
> There were 36 of the queries agains table ind_event (which is one affected
> by the import). Those queries usually take 5-10ms, and we never see more
> than 2 active during normal operation. These had been active for
> _minutes_....a sample of the running queries:
> 
> time_active   |       datname       | procpid |   query
> 
> -----------------+---------------------+---------+-------------------------------------------
>  00:08:10.891105 | tudb                |    9058 | select * from
> mr_uss_ind_event_x where (tu
>  00:08:10.981845 | tudb                |    8977 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:08.883347 | tudb                |    8930 | select * from
> mr_uss_ind_event_x where org
>  00:07:15.266393 | tudb                |    8927 | select * from
> mr_uss_ind_event_x where org
>  00:07:27.587133 | tudb                |   11867 | update msg_result set
> dt_result=$1,msg_id=
>  00:08:06.458885 | tudb                |    8912 | select * from
> mr_uss_ind_event_x where org
>  00:06:43.036487 | tudb                |    8887 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:01.992367 | tudb                |    8831 | select * from
> mr_uss_ind_event_x where (tu
>  00:06:59.217721 | tudb                |    8816 | select * from
> mr_uss_ind_event_x where org
>  00:07:07.558848 | tudb                |    8811 | update md_invoice set
> unbilled_amt=unbille
>  00:07:30.636192 | tudb                |    8055 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:26.761053 | tudb                |    8053 | update msg_result set
> dt_result=$1,msg_id=
>  00:06:46.021084 | tudb                |    8793 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:26.412781 | tudb                |    8041 | select * from
> mr_uss_ind_event_x where org
>  00:07:43.315019 | tudb                |    8031 | select * from
> mr_uss_ind_event_x where org
>  00:07:42.651143 | tudb                |    7990 | select * from
> mr_uss_ind_event_x where org
>  00:06:45.258232 | tudb                |    7973 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:46.135027 | tudb                |    7961 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:31.814513 | tudb                |    7959 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:27.739959 | tudb                |    8221 | select * from
> mr_uss_ind_event_x where org
>  00:07:21.554369 | tudb                |    8191 | select * from
> mr_uss_ind_event_x where org
>  00:07:30.665133 | tudb                |    7953 | select * from
> mr_uss_ind_event_x where org
>  00:07:17.727028 | tudb                |    7950 | select * from
> mr_uss_ind_event_x where org
>  00:07:25.657611 | tudb                |    7948 | select * from
> mr_uss_ind_event_x where org
>  00:07:28.118856 | tudb                |    7939 | select * from
> mr_uss_ind_event_x where org
>  00:07:32.436705 | tudb                |    7874 | insert into
> mr_uss_ind_event (prelimtime_c
>  00:08:12.090187 | tudb                |    7873 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:19.181981 | tudb                |    7914 | select * from
> mr_uss_ind_event_x where (tu
>  00:07:04.234119 | tudb                |    7909 | select * from
> mr_uss_ind_event_x where (tu
>  00:06:52.614609 | tudb                |    7856 | select * from
> mr_uss_ind_event_x where org
>  00:07:18.667903 | tudb                |    7908 | select * from
> mr_uss_ind_event_x where (tu
> 
> The insert listed there is coming from that import...the others are
> querying a view that includes that table in a join.
> 
> Also, try to collect a few snapshots of pg_stat_bgwriter catalog before
> > and during the loads. Don't forget to include the timestamp:
> >
> >    select now(), * from pg_stat_bgwriter;
> >
> >
> This is a live production system, and it will take me some doing to
> generate a load on a test server that triggers the condition. I'll be
> certain to gather this and the other stats if I can trigger it.
> 
> 
> > and when you're at it, pg_stat_database snapshots might be handy too
> > (not sure if you're running a single database or multiple ones), so use
> > either
> >
> >   select now(), * from pg_stat_database;
> >
> > or
> >
> >   select now(), * from pg_stat_database where datname = '..dbname..';
> >
> > That should give us at least some insight into what's happening.
> >
> > > Our workaround (that is holding at present) was to drop the
> > > transactions on those imports (which is not optimal, but fortunately
> > > is acceptable for this particular data). This workaround has
> > > prevented any further incidents, but is of course inconclusive.
> > >
> > > Does this sound familiar to anyone, and if so, please advise.
> >
> > I'm wondering how this could be related to the transactions, and IIRC
> > the stats (e.g. # of inserted rows) are sent at commit time. That might
> > trigger the autovacuum. But without the transactions the autovacuum
> > would be triggered sooner ...
> >
> > regards
> > Tomas
> >
> >
> > --
> > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-performance
> >

-- 
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux