SSI slows down over time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

(Referred here from pgsql-performance)

tl;dr: every time I shut down a database and bring it back up, SSI seems to go slower. In order to avoid thousands of SSI aborts due to running out of shared memory, I've had to set max_predicate_locks to several thousand (2000 is tolerable, 8000 required to avoid all errors); this seems excessively high considering how short TPC-C transactions are, and how aggressively SSI reclaims storage. I've also found what appears to be a bug, where the SSI SLRU storage (pg_serial) sometimes jumps from ~200kB to 8GB within a matter of seconds. The 8GB persists through later runs and seems to be part of the performance problem; deleting the pg_serial directory after each database shutdown seems to resolve most of the problem.

Excruciatingly long and detailed version follows...

This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure --prefix=...' and gcc-4.7.
24-core Intel box with hyperthreading (so 48 contexts).
TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients.

I get a strange behavior across repeated runs: each 100-second run is a bit slower than the one preceding it, when run with SSI (SERIALIZABLE). Switching to SI (REPEATABLE_READ) removes the problem, so it's apparently not due to the database growing. The database is completely shut down (pg_ctl stop) between runs, but the data lives in tmpfs, so there's no I/O problem here. 64GB RAM, so no paging, either.

Note that this slowdown is in addition to the 30% performance gap between SI and SSI on my 24-core machine. I understand that the latter is a known bottleneck [1]; my question is why the bottleneck should get worse over time:

With SI, I get ~4.4ktps, consistently.
With SSI, I get 3.9, 3.8, 3.4. 3.3, 3.1, 2.9, ...

So the question: what should I look for to diagnose/triage this problem? I've done some legwork already, but have no idea where to go next.

Looking through the logs, abort rates due to SSI aren't changing in any obvious way. I've been hacking on SSI for over a month now as part of a research project, and am fairly familiar with predicate.c, but I don't see any obvious reason this behavior should arise (in particular, SLRU storage seems to be re-initialized every time the postmaster restarts, so there shouldn't be any particular memory effect due to SIREAD locks). I'm also familiar with both Cahill's and Ports/Grittner's published descriptions of SSI, but again, nothing obvious jumps out.

Top reports only 50-60% CPU usage for most clients, and linux perf shows 5-10% of time going to various lwlock calls. Compiling with -DLWLOCK_STATS and comparing results for SI vs. SSI shows that the WALInsertLock (7) is white-hot in all cases, followed by almost equally white-hot SSI lwlocks (28,30,29) and the 16 PredicateLockMgrLocks (65-80). Other than the log bottleneck, this aligns precisely with previous results reported by others [1]; unfortunately, I don't see anything obvious in the lock stats to say why the problem is getting worse over time.

In my experience this sort of behavior indicates a bug, where fixing it could have a significant impact on performance (because the early "damage" is done so quickly after start-up that even the very first run doesn't live up to its true potential).

I also strongly suspect a bug because the SLRU storage (pg_serial) occasionally jumps from the 100-200kB range to 8GB. It's rare, but when it happens, performance tanks to tens of tps for the rest of the run, and the 8GB persists into subsequent runs. I saw some code comments in predicate.c suggesting that SLRU pages which fall out of range would not be reclaimed until the next SLRU wraparound. Deleting pg_serial/* before each `pg_ctl start' seems to remove most of the problem (and, from my understanding of SSI, should be harmless, because no serial conflicts can persists across a database shutdown).

I tried to repro, and a series of 30-second runs gave the following throughputs (tps):
*4615
3155 3149 3115 3206 3162 3069 3005 2978 2953 **308
2871 2876 2838 2853 2817 2768 2736 2782 2732 2833
2749 2675 2771 2700 2675 2682 2647 2572 2626 2567
*4394

That ** entry was an 8GB blow-up. All files in the directory had been created at the same time (= not during a previous run), and persisted through the runs that followed. There was also a run where abort rates jumped through the roof (~40k aborts rather than the usual 2000 or so), with a huge number of "out of shared memory" errors; apparently max_predicate_locks=2000 wasn't high enough.

$ cat pgsql.conf
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 64
max_pred_locks_per_transaction = 2000
default_statistics_target = 100
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
effective_cache_size = 40GB
work_mem = 1920MB
wal_buffers = 16MB


[1] http://www.postgresql.org/message-id/CA+TgmoYAiSM2jWEndReY5PL0sKbhgg7dbDH6r=oXKYzi9B7KJA@xxxxxxxxxxxxxx

Thoughts?
Ryan



--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux