Context switch storms

"Omar Kilani" <omar.kilani@xxxxxxxxx> · Tue, 2 Dec 2008 19:51:06 +1100

Hi there,

We've recently started seeing some context switch storm issues on our
primary Postgres database, and I was wondering if anyone had
encountered similar issues or had any ideas as to what could be
causing these issues.

The machine configuration is:

8xIntel Xeon Harpertown 5430 (2.66GHz)
32Gb of RAM
10xRAID10 (15k RPM, SAS) for Postgres
RHEL 5.2 (2.6.18-92.1.10.el5) + deadline scheduler

We're running Postgres 8.3.4.

Some postgresql.conf settings:

shared_buffers = 10922MB
effective_cache_size = 19200MB
default_statistics_target = 100

The database is about 150Gb in size (according to pg_database_size.)

Our workload is probably something like 98% reads / 2% writes. Most of
our queries are fast, short-lived SELECTs across 5 tables.

Some performance numbers during "normal" times and when the CS storm
is in progress:

Normal load:

Req/s - 3326 - 3630
Average query runtime  - 4729 us - 7415 us
count(*) from pg_locks - 10 - 200
Context switches/s - 12k - 21k

During CS storm:

Req/s - 2362 - 3054
Average query runtime - 20731 us - 103387 us
count(*) from pg_locks - 1000 - 1400
Context switches/s - 38k - 55k

During the CS storm period, 99% of locks are granted = t, mode =
AccessShareLock across the 5 most commonly read tables (and their
indexes). In one sample of pg_locks, there was only one
RowExclusiveLock on a less frequently read table.

We used to use a 16-way Intel Xeon Tigerton in this machine, but
Postgres would basically become unresponsive under 120k - 200k context
switches/s, so we switched to the 8-way Harpertown. The problems
disappeared, but they've now come back. :)

I asked Gavin S and Neil C about this issue before mailing the list --
Gavin said this was a known issue which was hard to reproduce, and
Neil said that most (all known?) context switch issues were fixed in
8.2+.

We can pretty much reproduce this consistently, though it doesn't
happen *all* the time (maybe 2-4 hours every week).

Postgres is the only thing running on the machine -- and at the time
of these CS spikes autovacuum is not running and there was no
checkpoint in progress.

Please let me know if any further information is needed, or if there's
anything I can do to try and gain more insight into the cause of these
CS storms.

Thanks!

Regards,
Omar

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance