SSI slows down over time

Ryan Johnson <ryan.johnson@xxxxxxxxxxxxxx> · Mon, 14 Apr 2014 08:58:14 -0400

Hi all,

(Referred here from pgsql-performance)

tl;dr: every time I shut down a database and bring it back up, SSI seems 
to go slower. In order to avoid thousands of SSI aborts due to running 
out of shared memory, I've had to set max_predicate_locks to several 
thousand (2000 is tolerable, 8000 required to avoid all errors); this 
seems excessively high considering how short TPC-C transactions are, and 
how aggressively SSI reclaims storage. I've also found what appears to 
be a bug, where the SSI SLRU storage (pg_serial) sometimes jumps from 
~200kB to 8GB within a matter of seconds. The 8GB persists through later 
runs and seems to be part of the performance problem; deleting the 
pg_serial directory after each database shutdown seems to resolve most 
of the problem.

Excruciatingly long and detailed version follows...

This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure 
--prefix=...' and gcc-4.7.
24-core Intel box with hyperthreading (so 48 contexts).
TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients.

I get a strange behavior across repeated runs: each 100-second run is a 
bit slower than the one preceding it, when run with SSI (SERIALIZABLE). 
Switching to SI (REPEATABLE_READ) removes the problem, so it's 
apparently not due to the database growing. The database is completely 
shut down (pg_ctl stop) between runs, but the data lives in tmpfs, so 
there's no I/O problem here. 64GB RAM, so no paging, either.

Note that this slowdown is in addition to the 30% performance gap 
between SI and SSI on my 24-core machine. I understand that the latter 
is a known bottleneck [1]; my question is why the bottleneck should get 
worse over time:

With SI, I get ~4.4ktps, consistently.
With SSI, I get 3.9, 3.8, 3.4. 3.3, 3.1, 2.9, ...

So the question: what should I look for to diagnose/triage this problem? 
I've done some legwork already, but have no idea where to go next.

Looking through the logs, abort rates due to SSI aren't changing in any 
obvious way. I've been hacking on SSI for over a month now as part of a 
research project, and am fairly familiar with predicate.c, but I don't 
see any obvious reason this behavior should arise (in particular, SLRU 
storage seems to be re-initialized every time the postmaster restarts, 
so there shouldn't be any particular memory effect due to SIREAD locks). 
I'm also familiar with both Cahill's and Ports/Grittner's published 
descriptions of SSI, but again, nothing obvious jumps out.

Top reports only 50-60% CPU usage for most clients, and linux perf shows 
5-10% of time going to various lwlock calls. Compiling with 
-DLWLOCK_STATS and comparing results for SI vs. SSI shows that the 
WALInsertLock (7) is white-hot in all cases, followed by almost equally 
white-hot SSI lwlocks (28,30,29) and the 16 PredicateLockMgrLocks 
(65-80). Other than the log bottleneck, this aligns precisely with 
previous results reported by others [1]; unfortunately, I don't see 
anything obvious in the lock stats to say why the problem is getting 
worse over time.

In my experience this sort of behavior indicates a bug, where fixing it 
could have a significant impact on performance (because the early 
"damage" is done so quickly after start-up that even the very first run 
doesn't live up to its true potential).

I also strongly suspect a bug because the SLRU storage (pg_serial) 
occasionally jumps from the 100-200kB range to 8GB. It's rare, but when 
it happens, performance tanks to tens of tps for the rest of the run, 
and the 8GB persists into subsequent runs. I saw some code comments in 
predicate.c suggesting that SLRU pages which fall out of range would not 
be reclaimed until the next SLRU wraparound. Deleting pg_serial/* before 
each `pg_ctl start' seems to remove most of the problem (and, from my 
understanding of SSI, should be harmless, because no serial conflicts 
can persists across a database shutdown).

I tried to repro, and a series of 30-second runs gave the following 
throughputs (tps):
*4615
3155 3149 3115 3206 3162 3069 3005 2978 2953 **308
2871 2876 2838 2853 2817 2768 2736 2782 2732 2833
2749 2675 2771 2700 2675 2682 2647 2572 2626 2567
*4394

That ** entry was an 8GB blow-up. All files in the directory had been 
created at the same time (= not during a previous run), and persisted 
through the runs that followed. There was also a run where abort rates 
jumped through the roof (~40k aborts rather than the usual 2000 or so), 
with a huge number of "out of shared memory" errors; apparently 
max_predicate_locks=2000 wasn't high enough.

$ cat pgsql.conf
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 64
max_pred_locks_per_transaction = 2000
default_statistics_target = 100
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
effective_cache_size = 40GB
work_mem = 1920MB
wal_buffers = 16MB

[1] 
http://www.postgresql.org/message-id/CA+TgmoYAiSM2jWEndReY5PL0sKbhgg7dbDH6r=oXKYzi9B7KJA@xxxxxxxxxxxxxx

Thoughts?
Ryan

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance