Hi Calvin,
Yes, I have sar data on all systems going back for years.
Since others are going to probably want to be assured I am really "reading the data" right:
- This is 92% user CPU time, 5% sys, and 1% soft
- On some of the problems, I _do_ see a short spike of pgswpout's (memory pressure), but again, not enough to end up using much system time
- The database disks are idle (all data being used is in RAM)..and are SSDs....average service times are barely measurable in ms.
If I had to guess, I'd say it was spinlock misbehavior....I cannot understand why ekse a transaction blocking other things would drive the CPUs so hard into the ground with user time.
Tony
On Mon, Oct 14, 2013 at 4:05 PM, Calvin Dodge <caldodge@xxxxxxxxx> wrote:
Have you tried running "vmstat 1" during these times? If so, what is
the percentage of WAIT time? Given that IIRC shared buffers should be
no more than 25% of installed memory, I wonder if too little is
available for system caching of disk reads. A high WAIT percentage
would indicate excessive I/O (especially random seeks).
Calvin Dodge
On Mon, Oct 14, 2013 at 6:00 PM, Tony Kay <tony@xxxxxxxxxxxxx> wrote:
> Hi,
>
> I'm running 9.1.6 w/22GB shared buffers, and 32GB overall RAM on a 16
> Opteron 6276 CPU box. We limit connections to roughly 120, but our webapp is
> configured to allocate a thread-local connection, so those connections are
> rarely doing anything more than half the time.
>
> We have been running smoothly for over a year on this configuration, and
> recently started having huge CPU spikes that bring the system to its knees.
> Given that it is a multiuser system, it has been quite hard to pinpoint the
> exact cause, but I think we've narrowed it down to two data import jobs that
> were running in semi-long transactions (clusters of row inserts).
>
> The tables affected by these inserts are used in common queries.
>
> The imports will bring in a row count of perhaps 10k on average covering 4
> tables.
>
> The insert transactions are at isolation level read committed (the default
> for the JDBC driver).
>
> When the import would run (again, theory...we have not been able to
> reproduce), we would end up maxed out on CPU, with a load average of 50 for
> 16 CPUs (our normal busy usage is a load average of 5 out of 16 CPUs).
>
> When looking at the active queries, most of them are against the tables that
> are affected by these imports.
>
> Our workaround (that is holding at present) was to drop the transactions on
> those imports (which is not optimal, but fortunately is acceptable for this
> particular data). This workaround has prevented any further incidents, but
> is of course inconclusive.
>
> Does this sound familiar to anyone, and if so, please advise.
>
> Thanks in advance,
>
> Tony Kay
>