Gianni Ceccarelli <dakkar@xxxxxxxxxxxxxxx> wrote: > At work we have a program that seems to be stressing the SSI > implementation, and I thought that it could provide useful > insights to better tune it. In particular, there are a few parts > that are described as "chosen entirely arbitrarily (and without > benchmarking)", and we may provide some of that benchmarking. Any insights on where the current algorithms don't perform well would be welcome. Of course, the subject line gives me some pause -- I'm aware of many uses of SSI in non-trivial production environments, including multi-terrabyte databases with thousands of concurrent users. In some cases the performance hit compared to REPEATABLE READ, which would allow anomalies, is about 1%. > First of all, we're running "PostgreSQL 9.2.4 on > x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080 704 > (Red Hat 4.1.2-52), 64-bit" > > The program consumes messages from a message bus (ActiveMQ in our > case), and uses the data contained in them to update unstructured > documents; some values from those documents are extracted into an > attribute-value table to make it possible to search for them > later. Using SSI to manage a database queue is known to be a "worst case" workload. Since there is no blocking beyond what is present in snapshot isolation level techniques (which is what you get at the REPEATABLE READ level), all concurrent attempts will pull the same item off the queue and attempt to process it, and all but one will be rolled back with a serialization failure. Efficient management of a queue in the database really requires blocking techniques to actually serialize the accesses to the ends of the queue. That said, it sounds like you are not using SSI for that, but leaving the queue management to ActiveMQ, which presumably handles this correctly, and your problems are with the access to the documents based on the requests pulled from the queue. Is that correct? > The schema is essentially this:: > > CREATE TABLE docs ( > id VARCHAR(255) PRIMARY KEY, > contents TEXT NOT NULL > ); > > CREATE TABLE doc_attributes ( > document_id VARCHAR(255) NOT NULL REFERENCES docs(id) > ON DELETE CASCADE, > attribute_name VARCHAR(255) NOT NULL, > value VARCHAR(255) NOT NULL > ); > > CREATE INDEX idx_attribute_doc > ON doc_attributes(document_id); > > CREATE INDEX idx_attribute_name_str > ON doc_attributes(attribute_name,value); > > The interesting part of the program works like this: > > * Figure out which documents to update:: > > BEGIN; > SET TRANSACTION ISOLATION LEVEL READ COMMITTED; > SELECT id FROM docs WHERE ...; > COMMIT; > > * Update each of them in turn:: > > BEGIN; > SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; > SELECT contents FROM docs WHERE id=?; > -- change the contents, in client code > UPDATE docs SET contents=? WHERE id=?; > DELETE FROM doc_attributes WHERE document_id=?; > INSERT INTO doc_attributes(document_id,attribute_name,value) > VALUES (?,?,?); -- for each attribute > COMMIT; I'm afraid that one of the most interesting and pertinent bits of the code is written as ellipses. Can you show or describe what the REPEATABLE READ transaction is doing in more detail? > If we receive a serialisation error, we retry the whole > transaction, applying the changes to the new version of the > document. Each retry takes about 0.1 seconds. What percentage of transactions are rolled back? What is the processing time in such transactions as a percentage of the total load? > We have a few processes doing this in parallel, to keep up with > the amount of messages that are sent. We have an average of 30 > rows in ``doc_attribute`` for each row in ``docs``. This is a > typical situation:: > > SELECT pid, locktype, > COUNT(*)/COUNT(DISTINCT virtualtransaction) AS tl, > COUNT(*) AS total > FROM pg_locks > WHERE mode LIKE 'SI%' > GROUP BY pid, locktype > ORDER BY pid, locktype; > > pid | locktype | tl | total > ------+----------+-----+------- > 445 | page | 5 | 2706 > 445 | tuple | 1 | 767 > 446 | page | 14 | 28 > 446 | tuple | 37 | 74 > 447 | page | 1 | 19 > 448 | page | 1 | 19 > 449 | page | 5 | 2759 > 449 | tuple | 1 | 758 > 454 | page | 10 | 2209 > 454 | tuple | 37 | 7663 > 1113 | page | 5 | 604 > 1113 | tuple | 4 | 531 > 1346 | page | 6 | 1557 > 1346 | tuple | 1 | 454 > | page | 174 | 174 > | tuple | 236 | 236 > (16 rows) > > Due to the large number of predicate locks, we have > ``max_pred_locks_per_transaction = 10000``, and ``max_connections > = 300`` (this is probably going to be reduced, we don't need more > than 100). > > Questions: > > - What are locks without a pid? I thought they were leftover from > transactions of now-disconnected clients, awaiting that all > overlapping transactions complete, but the numbers don't behave > as I would expect in that case (i.e. they don't grow when a > client disconnect) Those are predicate locks related to a transaction which has been PREPARED (for two-phase commit) or committed, but which may still be relevant because there are overlapping read-write transactions which are still active. One long-running SELECT statement, if it is not declared to be in a read-only transaction, could cause a large number of such locks to accumulate. If you are not already doing so, make sure that any serializable transaction which is not going to modify data is declared to be READ ONLY at the start. I have seen some shops set that to the default and explicitly declare transactions to be READ WRITE when that is needed. > - Is the large number of page locks to be expected? How long > should we expect them to stay? Some seem to stay around for > minutes. There is probably some database transaction (and that will count individual statements not explicitly included in transactions) which is running for minutes. > - Can this be of any use to benchmarking / tuning the SSI logic? As you probably noticed, the heuristics in PredicateLockPromotionThreshold() are pretty simple. We promote to a page lock when a process acquires its third tuple lock on a page, and we convert to a relation lock when a process uses half of max_pred_locks_per_transaction on a single relation. The thing was, even those crude heuristics worked quite well in our testing, as long as max_pred_locks_per_transaction and possibly max_connections were adjusted. (It is something of a hack that it can be beneficial to increase max_connections if you get a lot of page locks on a lot of tables, but something which rarely seems to be seen in practice -- and you can always just go really high on max_pred_locks_per_transaction instead.) We have no illusions that these are the best possible heuristics, but they have worked so well for the workloads we have tried, that there wasn't really a basis for testing an alternative. If you have a workload which you think would do better with something more sophisticated, it would be great to have more details. If you wanted to benchmark your workload against a custom version with a modified PredicateLockPromotionThreshold() function, that would be fantastic. There are many known ways in which SSI logic could be tuned. I've been meaning to itemize them on the TODO list, but will mention the ones which come to mind here to get them "on the record". If you have an interest in working on, or sponsoring development of, any of these -- that would be great. - A recent university study of various techniques for achieving serializable transactions found the PostgreSQL implementation to scale better than any others they tested, even with retries on serialization failure, but at high concurrency they found about half the CPU time going to these functions, which are O(N^2), and should probably be reworked to scale better: CreatePredXact() ReleasePredXact() FirstPredXact() NextPredXact() Perhaps a hash table instead of a double linked list is needed. - After the lightweight locking changes made in 9.2 to allow scaling to more processes, the lightweight locking in SSI predicate locking have become a bottleneck at higher concurrency. This area now needs tuning. This may or may not be related to the prior point. - btree locking does not go down to the "next-key" granularity that most predicate locking systems do; its finest granularity is page level. We could reduce false positive serialization failures by implementing next-key locking granularity. - Index types other than btree only support relation-level locks. Finer granularity would reduce false positives involving queries which select using other index types. - We don't distinguish between heap relation locks which need to prohibit inserts (those caused by a table scan) and heap relation locks which don't conflict with inserts (those caused by promotion from finer granularity). We would reduce false positives if we did. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general